[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-11-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #37 from Kewen Lin  ---
The degradation should be gone on Power now.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-11-02 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #36 from CVS Commits  ---
The master branch has been updated by Kewen Lin :

https://gcc.gnu.org/g:f5e18dd9c7dacc9671044fc669bd5c1b26b6bdba

commit r11-4637-gf5e18dd9c7dacc9671044fc669bd5c1b26b6bdba
Author: Kewen Lin 
Date:   Tue Nov 3 02:51:47 2020 +

pass: Run cleanup passes before SLP [PR96789]

As the discussion in PR96789, we found that some scalar stmts
which can be eliminated by some passes after SLP, but we still
modeled their costs when trying to SLP, it could impact
vectorizer's decision.  One typical case is the case in PR96789
on target Power.

As Richard suggested there, this patch is to introduce one pass
called pre_slp_scalar_cleanup which has some secondary clean up
passes, for now they are FRE and DSE.  It introduces one new
TODO flags group called pending TODO flags, unlike normal TODO
flags, the pending TODO flags are passed down in the pipeline
until one of its consumers can perform the requested action.
Consumers should then clear the flags for the actions that they
have taken.

Soem compilation time statistics on all SPEC2017 INT bmks were
collected on one Power9 machine for several option sets below:
  A1: -Ofast -funroll-loops
  A2: -O1
  A3: -O1 -funroll-loops
  A4: -O2
  A5: -O2 -funroll-loops

the corresponding increment rate is trivial:
  A1   A2   A3A4A5
  0.08%0.00%-0.38%-0.10%-0.05%

Bootstrapped/regtested on powerpc64le-linux-gnu P8.

gcc/ChangeLog:

PR tree-optimization/96789
* function.h (struct function): New member unsigned pending_TODOs.
* passes.c (class pass_pre_slp_scalar_cleanup): New class.
(make_pass_pre_slp_scalar_cleanup): New function.
(pass_data_pre_slp_scalar_cleanup): New pass data.
* passes.def: (pass_pre_slp_scalar_cleanup): New pass, add
pass_fre and pass_dse as its children.
* timevar.def (TV_SCALAR_CLEANUP): New timevar.
* tree-pass.h (PENDING_TODO_force_next_scalar_cleanup): New
pending TODO flag.
(make_pass_pre_slp_scalar_cleanup): New declare.
* tree-ssa-loop-ivcanon.c (tree_unroll_loops_completely_1):
Once any outermost loop gets unrolled, flag cfun pending_TODOs
PENDING_TODO_force_next_scalar_cleanup on.

gcc/testsuite/ChangeLog:

PR tree-optimization/96789
* gcc.dg/tree-ssa/ssa-dse-28.c: Adjust.
* gcc.dg/tree-ssa/ssa-dse-29.c: Likewise.
* gcc.dg/vect/bb-slp-41.c: Likewise.
* gcc.dg/tree-ssa/pr96789.c: New test.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #35 from rsandifo at gcc dot gnu.org  
---
(In reply to rguent...@suse.de from comment #24)
> On September 27, 2020 4:56:43 AM GMT+02:00, crazylht at gmail dot com
>  wrote:
> >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> >
> >--- Comment #22 from Hongtao.liu  ---
> >>One of my workmates found that if we disable vectorization for
> >SPEC2017 >525.x264_r function sub4x4_dct in source file
> >x264_src/common/dct.c with ?>explicit function attribute
> >__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
> >
> >For CLX, if we disable slp vectorization in sub4x4_dct by 
> >__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed
> >up by 4%.
> >
> >> Thanks Richi! Should we take care of this case? or neglect this kind
> >of
> >> extension as "no instruction"? I was intent to handle it in target
> >specific
> >> code, but it isn't recorded into cost vector while it seems too heavy
> >to do
> >> the bb_info slp_instances revisits in finish_cost.
> >
> >For i386 backend unsigned char --> unsigned short is no "no
> >instruction", but
> >in this case
> >---
> >1033  _134 = MEM[(pixel *)pix1_295 + 2B];  
> >
> >1034  _135 = (short unsigned int) _134;
> >---
> >
> >It could be combined and optimized to 
> >---
> >movzbl  19(%rcx), %r8d
> >---
> >
> >So, if "unsigned char" variable is loaded from memory, then the
> >convertion
> >would also be "no instruction", i'm not sure if backend cost model
> >could handle
> >such situation.
> 
> I think all attempts to address this from the side of the scalar cost is
> going to be difficult and fragile..
Agreed FWIW.  Even in rtl, the kinds of conversion we're talking
about could be removed, such as by proving that the upper bits are
already correct, by combining the extension with other instructions
so that it becomes “free” again, or by ree.  Proving that the upper
bits are already correct isn't uncommon: gimple has to make a choice
between signed and unsigned types even if both choices would be
correct, whereas rtl is sign-agnostic for storage.

So it's not obvious to me that trying model things at this level is
going to be right more often than it's wrong.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-28 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #34 from Hongtao.liu  ---
(In reply to Kewen Lin from comment #29)
> (In reply to Hongtao.liu from comment #28)
> > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > 
> > Yes, it's the place.
> > 
> > > is UB to UH conversion statement, further check if the def of the input UB
> > > is MEM.
> > 
> > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > guess later optimizations for the purpose of more accurate vectorization
> > cost model, :(.
> 
> Yeah, it's hard sadly. The generic cost modeling is rough,
> ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> :)), if you want to check it more, it seems doable in target specific hook
> finish_cost where you can get the whole vinfo object, but it could end up
> with very heavy analysis and might not be worthy.
> 
> Do you mind to check if it can also fix this degradation on x86 to run FRE
> and DSE just after cunroll? I found it worked for Power, hoped it can help
> there too.

No, it's not working for CLX, problem in i386 backend is a bit different.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #33 from Richard Biener  ---
(In reply to Kewen Lin from comment #32)
> (In reply to Richard Biener from comment #31)
> > (In reply to Kewen Lin from comment #29)
> > > (In reply to Hongtao.liu from comment #28)
> > > > > Probably you can try to tweak it in ix86_add_stmt_cost? when the 
> > > > > statement
> > > > 
> > > > Yes, it's the place.
> > > > 
> > > > > is UB to UH conversion statement, further check if the def of the 
> > > > > input UB
> > > > > is MEM.
> > > > 
> > > > Only if there's no multi-use for UB. More generally, it's quite 
> > > > difficult to
> > > > guess later optimizations for the purpose of more accurate vectorization
> > > > cost model, :(.
> > > 
> > > Yeah, it's hard sadly. The generic cost modeling is rough,
> > > ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> > > :)), if you want to check it more, it seems doable in target specific hook
> > > finish_cost where you can get the whole vinfo object, but it could end up
> > > with very heavy analysis and might not be worthy.
> > > 
> > > Do you mind to check if it can also fix this degradation on x86 to run FRE
> > > and DSE just after cunroll? I found it worked for Power, hoped it can help
> > > there too.
> > 
> > Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
> > returned from passes that see cleanup opportunities and have the pass
> > manager queue that up, looking for a special marked pass and enabling
> > that so we could have
> > 
> >   NEXT_PASS (pass_predcom);
> >   NEXT_PASS (pass_complete_unroll);
> >   NEXT_PASS (pass_scalar_cleanup);
> >   PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
> > NEXT_PASS (pass_fre, false /* may_iterate */);
> > NEXT_PASS (pass_dse);
> >   POP_INSERT_PASSES ();
> > 
> > with pass_scalar_cleanup gate() returning false otherwise.  Eventually
> > pass properties would match this better, or sth else.
> > 
> 
> Thanks for the suggestion! Before cooking the patch, I have one question
> that it looks to only update function property is enough, eg: some pass sets
> property PROP_ok_for_cleanup and later pass_scalar_cleanup only goes for the
> func with this property (checking in gate), I'm not quite sure the reason
> for the TODO_flag TODO_force_next_scalar_cleanup.

properties are not an easy fit since they are static in the pass
description while we want to trigger the cleanup only if we unrolled
an outermost loop for example.  Returning TODO_force_next_scalar_cleanup
from cunroll is the natural way to signal this.

The hackish way then would be to "queue" this TODO_force_next_scalar_cleanup
inside the pass manager itself until it runs into a
"scalar cleanup pass" (either somehow marked or just hard-matched), forcing
its gate() to evaluate to true (just skipping its evaluation).

As said, there's no "nice" way for the information flow at the moment.

One could expose the "pending TODO" (TODO not handled by the pass manager
itself) in a global variable (like current_pass) so the cleanup pass
gate() could check

  gate () { return pending_todo & TODO_force_next_scalar_cleanup; }

and in its execute clear this bit from pending_todo.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #32 from Kewen Lin  ---
(In reply to Richard Biener from comment #31)
> (In reply to Kewen Lin from comment #29)
> > (In reply to Hongtao.liu from comment #28)
> > > > Probably you can try to tweak it in ix86_add_stmt_cost? when the 
> > > > statement
> > > 
> > > Yes, it's the place.
> > > 
> > > > is UB to UH conversion statement, further check if the def of the input 
> > > > UB
> > > > is MEM.
> > > 
> > > Only if there's no multi-use for UB. More generally, it's quite difficult 
> > > to
> > > guess later optimizations for the purpose of more accurate vectorization
> > > cost model, :(.
> > 
> > Yeah, it's hard sadly. The generic cost modeling is rough,
> > ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> > :)), if you want to check it more, it seems doable in target specific hook
> > finish_cost where you can get the whole vinfo object, but it could end up
> > with very heavy analysis and might not be worthy.
> > 
> > Do you mind to check if it can also fix this degradation on x86 to run FRE
> > and DSE just after cunroll? I found it worked for Power, hoped it can help
> > there too.
> 
> Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
> returned from passes that see cleanup opportunities and have the pass
> manager queue that up, looking for a special marked pass and enabling
> that so we could have
> 
>   NEXT_PASS (pass_predcom);
>   NEXT_PASS (pass_complete_unroll);
>   NEXT_PASS (pass_scalar_cleanup);
>   PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
> NEXT_PASS (pass_fre, false /* may_iterate */);
> NEXT_PASS (pass_dse);
>   POP_INSERT_PASSES ();
> 
> with pass_scalar_cleanup gate() returning false otherwise.  Eventually
> pass properties would match this better, or sth else.
> 

Thanks for the suggestion! Before cooking the patch, I have one question that
it looks to only update function property is enough, eg: some pass sets
property PROP_ok_for_cleanup and later pass_scalar_cleanup only goes for the
func with this property (checking in gate), I'm not quite sure the reason for
the TODO_flag TODO_force_next_scalar_cleanup.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #31 from Richard Biener  ---
(In reply to Kewen Lin from comment #29)
> (In reply to Hongtao.liu from comment #28)
> > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > 
> > Yes, it's the place.
> > 
> > > is UB to UH conversion statement, further check if the def of the input UB
> > > is MEM.
> > 
> > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > guess later optimizations for the purpose of more accurate vectorization
> > cost model, :(.
> 
> Yeah, it's hard sadly. The generic cost modeling is rough,
> ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> :)), if you want to check it more, it seems doable in target specific hook
> finish_cost where you can get the whole vinfo object, but it could end up
> with very heavy analysis and might not be worthy.
> 
> Do you mind to check if it can also fix this degradation on x86 to run FRE
> and DSE just after cunroll? I found it worked for Power, hoped it can help
> there too.

Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
returned from passes that see cleanup opportunities and have the pass
manager queue that up, looking for a special marked pass and enabling
that so we could have

  NEXT_PASS (pass_predcom);
  NEXT_PASS (pass_complete_unroll);
  NEXT_PASS (pass_scalar_cleanup);
  PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
NEXT_PASS (pass_fre, false /* may_iterate */);
NEXT_PASS (pass_dse);
  POP_INSERT_PASSES ();

with pass_scalar_cleanup gate() returning false otherwise.  Eventually
pass properties would match this better, or sth else.

That said, running a cleanup on the whole function should be done via
a separate pass - running a cleanup on a sub-CFG can be done from
within another pass.  But mind that sub-CFG cleanup really has to be
of O(size-of-sub-CFG), otherwise it doesn't help.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #30 from Richard Biener  ---
(In reply to Hongtao.liu from comment #23)
> >  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> > _576, _125, _143, _161, _179}; 
> 
> The cost of vec_construct in i386 backend is 64, calculated as 16 x 4
> 
> cut from i386.c
> ---
> /* N element inserts into SSE vectors.  */ 
> int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> ---
> 
> From perspective of pipeline latency, is seems ok, but from perspective of
> rtx_cost, it seems inaccurate since it would be initialized as
> ---
> vmovd   %eax, %xmm0
> vpinsrb $1, 1(%rsi), %xmm0, %xmm0
> vmovd   %eax, %xmm7
> vpinsrb $1, 3(%rsi), %xmm7, %xmm7
> vmovd   %eax, %xmm3
> vpinsrb $1, 17(%rsi), %xmm3, %xmm3
> vmovd   %eax, %xmm6
> vpinsrb $1, 19(%rsi), %xmm6, %xmm6
> vmovd   %eax, %xmm1
> vpinsrb $1, 33(%rsi), %xmm1, %xmm1
> vmovd   %eax, %xmm5
> vpinsrb $1, 35(%rsi), %xmm5, %xmm5
> vmovd   %eax, %xmm2
> vpinsrb $1, 49(%rsi), %xmm2, %xmm2
> vmovd   %eax, %xmm4
> vpinsrb $1, 51(%rsi), %xmm4, %xmm4
> vpunpcklwd  %xmm6, %xmm3, %xmm3
> vpunpcklwd  %xmm4, %xmm2, %xmm2
> vpunpcklwd  %xmm7, %xmm0, %xmm0
> vpunpcklwd  %xmm5, %xmm1, %xmm1
> vpunpckldq  %xmm2, %xmm1, %xmm1
> vpunpckldq  %xmm3, %xmm0, %xmm0
> vpunpcklqdq %xmm1, %xmm0, %xmm0
> ---
> 
> it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost
> should be 92(23 * 4).

So the important part for any target is that it makes the scalar and
vector costs apples and apples because they end up being compared
against each other.  For loops the most important metric tends to be
latency which is also the only thing that can be reasonably costed
when looking at a single statement at a time.  For all other factors
coming in there's (in theory) the finish_cost hook where, after
gathering individual stmt data from add_stmt_cost, a target hook can
apply adjustments based on say functional unit allocation (IIRC
the powerpc backend looks whether there are "many" shifts and
disparages vectorization in that case).

For the vector construction the x86 backend does a reasonable job
in costing - the only thing that's not very well modeled is the
extra cost of constructing from values in GPRs compared to
values in XMM regs (on some CPU archs that even as extra penalties).
But as seen above "GPR" values can also come from memory where
the difference vanishes (for AVX, not for SSE).

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #29 from Kewen Lin  ---
(In reply to Hongtao.liu from comment #28)
> > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> 
> Yes, it's the place.
> 
> > is UB to UH conversion statement, further check if the def of the input UB
> > is MEM.
> 
> Only if there's no multi-use for UB. More generally, it's quite difficult to
> guess later optimizations for the purpose of more accurate vectorization
> cost model, :(.

Yeah, it's hard sadly. The generic cost modeling is rough, ix86_add_stmt_cost
is more fine-grain (at least than what we have on Power :)), if you want to
check it more, it seems doable in target specific hook finish_cost where you
can get the whole vinfo object, but it could end up with very heavy analysis
and might not be worthy.

Do you mind to check if it can also fix this degradation on x86 to run FRE and
DSE just after cunroll? I found it worked for Power, hoped it can help there
too.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #28 from Hongtao.liu  ---

> Probably you can try to tweak it in ix86_add_stmt_cost? when the statement

Yes, it's the place.

> is UB to UH conversion statement, further check if the def of the input UB
> is MEM.

Only if there's no multi-use for UB. More generally, it's quite difficult to
guess later optimizations for the purpose of more accurate vectorization cost
model, :(.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #27 from Kewen Lin  ---
(In reply to Hongtao.liu from comment #22)
> >One of my workmates found that if we disable vectorization for SPEC2017 
> >>525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with 
> >?>explicit function attribute 
> >__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
> 
> For CLX, if we disable slp vectorization in sub4x4_dct by 
> __attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by
> 4%.
> 
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do
> > the bb_info slp_instances revisits in finish_cost.
> 
> For i386 backend unsigned char --> unsigned short is no "no instruction",

Thanks for the information, it means it's target specific.

> but in this case
> ---
> 1033  _134 = MEM[(pixel *)pix1_295 + 2B];   
> 
> 1034  _135 = (short unsigned int) _134;
> ---
> 
> It could be combined and optimized to 
> ---
> movzbl  19(%rcx), %r8d
> ---
> 
> So, if "unsigned char" variable is loaded from memory, then the convertion
> would also be "no instruction", i'm not sure if backend cost model could
> handle such situation.

Probably you can try to tweak it in ix86_add_stmt_cost? when the statement is
UB to UH conversion statement, further check if the def of the input UB is MEM.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #26 from Kewen Lin  ---
> > By following this idea, to release the restriction on loop_outer
> > (loop_father) when setting the father_bbs, I can see FRE works as
> > expectedly.  But it actually does the rpo_vn from cfun's entry to its exit.
> 
> Yeah, that's the reason we do not do it.  We could possibly restrict it
> to a containing loop, or if the containing loop is the whole function,
> restrict it to the original preheader block to the loop exits (which are
> no longer there, we'd need to pre-record those I think)

Thanks for the suggestion! 

I tried the idea to restrict it to run from the original preheader block to the
loop exits (pre-record both as you said), but it can't support the array d
eliminated finally, unfortunately this case requires VN to run across the
boundary between the original loops.  Now I ended up to run one time the whole
function VN if there isn't any loops after unrolling. I guess if there are no
loops, the CFG should be simple in most times and then not so costly? 

> > Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
> > for scalar codes it needs one more time dse run after cunroll to get array d
> > eliminated. I guess it's not costly? Can one pass be run or not controlled
> > by something in another pass? via global variable and add one parameter in
> > passes.def seems weird. If it's costly, probably we can go by factoring out
> > one routine to be called instead of running a pass, like do_rpo_vn?
> 
> No, we don't have a good way to schedule passes from other passes.  And yes,
> the way forward is to support key transforms on regions.  Oh, and every
> pass that does memory alias analysis (DSE, DCE, VN, etc.) is costly.
> 

OK, I'll have a look at DSE and try to get it to support region style. Although
it may not help this case since it needs to operate things across loop
boundary.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #25 from Kewen Lin  ---
> > 
> > Got it! For 
> > 
> >   else if (vect_nop_conversion_p (stmt_info))
> > continue;
> > 
> > Is it a good idea to change it to call record_stmt_cost like the others? 
> >   1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
> >   2) builtin_vectorization_cost return zero for it by default as before.
> >   3) targets can adjust the cost according to its need
> 
> I think this early-out was added for the case where there was no cost but
> the target costed it.  So at least go back and look what target that was
> and see if it can be adjusted.

OK, thanks. The change is from commit r10-4592, I think most of its handled
objects are no costs for all targets, like VIEW_CONVERT_EXPR, UL/SL
bi-direction conversions etc, so it's good to differentiate it from
scalar_stmt, but the "continue" way makes target hard to tweak for some
tree_nop_conversion_p stmts.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #24 from rguenther at suse dot de  ---
On September 27, 2020 4:56:43 AM GMT+02:00, crazylht at gmail dot com
 wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
>
>--- Comment #22 from Hongtao.liu  ---
>>One of my workmates found that if we disable vectorization for
>SPEC2017 >525.x264_r function sub4x4_dct in source file
>x264_src/common/dct.c with ?>explicit function attribute
>__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
>
>For CLX, if we disable slp vectorization in sub4x4_dct by 
>__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed
>up by 4%.
>
>> Thanks Richi! Should we take care of this case? or neglect this kind
>of
>> extension as "no instruction"? I was intent to handle it in target
>specific
>> code, but it isn't recorded into cost vector while it seems too heavy
>to do
>> the bb_info slp_instances revisits in finish_cost.
>
>For i386 backend unsigned char --> unsigned short is no "no
>instruction", but
>in this case
>---
>1033  _134 = MEM[(pixel *)pix1_295 + 2B];  
>
>1034  _135 = (short unsigned int) _134;
>---
>
>It could be combined and optimized to 
>---
>movzbl  19(%rcx), %r8d
>---
>
>So, if "unsigned char" variable is loaded from memory, then the
>convertion
>would also be "no instruction", i'm not sure if backend cost model
>could handle
>such situation.

I think all attempts to address this from the side of the scalar cost is going
to be difficult and fragile..

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #23 from Hongtao.liu  ---
>  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> _576, _125, _143, _161, _179}; 

The cost of vec_construct in i386 backend is 64, calculated as 16 x 4

cut from i386.c
---
/* N element inserts into SSE vectors.  */ 
int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
---

>From perspective of pipeline latency, is seems ok, but from perspective of
rtx_cost, it seems inaccurate since it would be initialized as
---
vmovd   %eax, %xmm0
vpinsrb $1, 1(%rsi), %xmm0, %xmm0
vmovd   %eax, %xmm7
vpinsrb $1, 3(%rsi), %xmm7, %xmm7
vmovd   %eax, %xmm3
vpinsrb $1, 17(%rsi), %xmm3, %xmm3
vmovd   %eax, %xmm6
vpinsrb $1, 19(%rsi), %xmm6, %xmm6
vmovd   %eax, %xmm1
vpinsrb $1, 33(%rsi), %xmm1, %xmm1
vmovd   %eax, %xmm5
vpinsrb $1, 35(%rsi), %xmm5, %xmm5
vmovd   %eax, %xmm2
vpinsrb $1, 49(%rsi), %xmm2, %xmm2
vmovd   %eax, %xmm4
vpinsrb $1, 51(%rsi), %xmm4, %xmm4
vpunpcklwd  %xmm6, %xmm3, %xmm3
vpunpcklwd  %xmm4, %xmm2, %xmm2
vpunpcklwd  %xmm7, %xmm0, %xmm0
vpunpcklwd  %xmm5, %xmm1, %xmm1
vpunpckldq  %xmm2, %xmm1, %xmm1
vpunpckldq  %xmm3, %xmm0, %xmm0
vpunpcklqdq %xmm1, %xmm0, %xmm0
---

it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost
should be 92(23 * 4).

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #22 from Hongtao.liu  ---
>One of my workmates found that if we disable vectorization for SPEC2017 
>>525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with 
>?>explicit function attribute __attribute__((optimize("no-tree-vectorize"))), 
>it >can speed up by 4%.

For CLX, if we disable slp vectorization in sub4x4_dct by 
__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by 4%.

> Thanks Richi! Should we take care of this case? or neglect this kind of
> extension as "no instruction"? I was intent to handle it in target specific
> code, but it isn't recorded into cost vector while it seems too heavy to do
> the bb_info slp_instances revisits in finish_cost.

For i386 backend unsigned char --> unsigned short is no "no instruction", but
in this case
---
1033  _134 = MEM[(pixel *)pix1_295 + 2B];   
1034  _135 = (short unsigned int) _134;
---

It could be combined and optimized to 
---
movzbl  19(%rcx), %r8d
---

So, if "unsigned char" variable is loaded from memory, then the convertion
would also be "no instruction", i'm not sure if backend cost model could handle
such situation.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #21 from Richard Biener  ---
(In reply to Kewen Lin from comment #18)
> (In reply to Richard Biener from comment #10)
> > (In reply to Kewen Lin from comment #9)
> > > (In reply to Richard Biener from comment #8)
> > > > (In reply to Kewen Lin from comment #7)
> > > > > Two questions in mind, need to dig into it further:
> > > > >   1) from the assembly of scalar/vector code, I don't see any stores 
> > > > > needed
> > > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > > consider the stores.
> > > > 
> > > > Because when modeling they are still there.  There's no good way around 
> > > > this.
> > > > 
> > > 
> > > I noticed the stores get eliminated during FRE.  Can we consider running 
> > > FRE
> > > once just before SLP? a bad idea due to compilation time?
> > 
> > Yeah, we already run FRE a lot and it is one of the more expensive passes.
> > 
> > Note there's one point we could do better which is the embedded SESE FRE
> > run from cunroll which is only run before we consider peeling an outer loop
> > and thus not for the outermost unrolled/peeled code (but the question would
> > be from where / up to what to apply FRE to).  On x86_64 this would apply to
> > the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> > quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> > matters for costing though).
> 
> By following this idea, to release the restriction on loop_outer
> (loop_father) when setting the father_bbs, I can see FRE works as
> expectedly.  But it actually does the rpo_vn from cfun's entry to its exit.

Yeah, that's the reason we do not do it.  We could possibly restrict it
to a containing loop, or if the containing loop is the whole function,
restrict it to the original preheader block to the loop exits (which are
no longer there, we'd need to pre-record those I think)

> If it's taken as costly, we probably can guard it to take effects only when
> all its inner loops are unrolled, for this case, all of its three loops are
> unrolled.

The original reason VN is done on unrolled bodies is to improve cost estimates
on unrolling nests - since we do not unroll the non-existing outer loop of the
outermost loop applying VN there for this reason is pointless and the
reasoning is to simply wait for the next scheduled VN pass (which is too late
for SLP)

> Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
> for scalar codes it needs one more time dse run after cunroll to get array d
> eliminated. I guess it's not costly? Can one pass be run or not controlled
> by something in another pass? via global variable and add one parameter in
> passes.def seems weird. If it's costly, probably we can go by factoring out
> one routine to be called instead of running a pass, like do_rpo_vn?

No, we don't have a good way to schedule passes from other passes.  And yes,
the way forward is to support key transforms on regions.  Oh, and every
pass that does memory alias analysis (DSE, DCE, VN, etc.) is costly.

What we could eventually do is move the non-loop SLP pass later and
run the loop SLP pass only on loop bodies (so overall every BB is just
SLP vectorized once).  The

  PUSH_INSERT_PASSES_WITHIN (pass_tree_no_loop)
  NEXT_PASS (pass_slp_vectorize);
  POP_INSERT_PASSES ()

pass would then be unconditional and only run on non-loop BBs.  Note
that SLP still leaves us with FRE opportunities so this will probably
not solve the issue fully.  Pass reorderings are also always tricky...
(adding passes is easy but piling these up over the time is bad).

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #20 from Richard Biener  ---
(In reply to Kewen Lin from comment #19)
> (In reply to rguent...@suse.de from comment #17)
> > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > 
> > > --- Comment #15 from Kewen Lin  ---
> > > (In reply to rguent...@suse.de from comment #14)
> > > > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > > > 
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > > > 
> > > > > --- Comment #13 from Kewen Lin  ---
> > > > > >   2) on Power, the conversion from unsigned char to unsigned short 
> > > > > > is nop
> > > > > > conversion, when we counting scalar cost, it's counted, then add 
> > > > > > costs 32
> > > > > > totally onto scalar cost. Meanwhile, the conversion from unsigned 
> > > > > > short to
> > > > > > signed short should be counted but it's not (need to check why 
> > > > > > further). 
> > > > > 
> > > > > UH to SH conversion is true when calling vect_nop_conversion_p, so 
> > > > > it's not
> > > > > even put into the cost vector. 
> > > > > 
> > > > > tree_nop_conversion_p's comments saying:
> > > > > 
> > > > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > > > >no instruction.  */
> > > > > 
> > > > > I may miss something here, but UH to SH conversion does need one 
> > > > > explicit
> > > > > extend instruction *extsh*, the precision/mode equality check looks 
> > > > > wrong for
> > > > > this conversion.
> > > > 
> > > > Well, it isn't a RTL predicate and it only needs extension because
> > > > there's never a HImode pseudo but always SImode subregs.
> > > 
> > > Thanks Richi! Should we take care of this case? or neglect this kind of
> > > extension as "no instruction"? I was intent to handle it in target 
> > > specific
> > > code, but it isn't recorded into cost vector while it seems too heavy to 
> > > do the
> > > bb_info slp_instances revisits in finish_cost.
> > 
> > I think it's not something we should handle on GIMPLE.
> 
> Got it! For 
> 
> else if (vect_nop_conversion_p (stmt_info))
>   continue;
> 
> Is it a good idea to change it to call record_stmt_cost like the others? 
>   1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
>   2) builtin_vectorization_cost return zero for it by default as before.
>   3) targets can adjust the cost according to its need

I think this early-out was added for the case where there was no cost but
the target costed it.  So at least go back and look what target that was
and see if it can be adjusted.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-25 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #19 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #17)
> On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > 
> > --- Comment #15 from Kewen Lin  ---
> > (In reply to rguent...@suse.de from comment #14)
> > > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > > 
> > > > --- Comment #13 from Kewen Lin  ---
> > > > >   2) on Power, the conversion from unsigned char to unsigned short is 
> > > > > nop
> > > > > conversion, when we counting scalar cost, it's counted, then add 
> > > > > costs 32
> > > > > totally onto scalar cost. Meanwhile, the conversion from unsigned 
> > > > > short to
> > > > > signed short should be counted but it's not (need to check why 
> > > > > further). 
> > > > 
> > > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's 
> > > > not
> > > > even put into the cost vector. 
> > > > 
> > > > tree_nop_conversion_p's comments saying:
> > > > 
> > > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > > >no instruction.  */
> > > > 
> > > > I may miss something here, but UH to SH conversion does need one 
> > > > explicit
> > > > extend instruction *extsh*, the precision/mode equality check looks 
> > > > wrong for
> > > > this conversion.
> > > 
> > > Well, it isn't a RTL predicate and it only needs extension because
> > > there's never a HImode pseudo but always SImode subregs.
> > 
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do 
> > the
> > bb_info slp_instances revisits in finish_cost.
> 
> I think it's not something we should handle on GIMPLE.

Got it! For 

  else if (vect_nop_conversion_p (stmt_info))
continue;

Is it a good idea to change it to call record_stmt_cost like the others? 
  1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
  2) builtin_vectorization_cost return zero for it by default as before.
  3) targets can adjust the cost according to its need

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-25 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #18 from Kewen Lin  ---
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores 
> > > > needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around 
> > > this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

By following this idea, to release the restriction on loop_outer (loop_father)
when setting the father_bbs, I can see FRE works as expectedly.  But it
actually does the rpo_vn from cfun's entry to its exit. If it's taken as
costly, we probably can guard it to take effects only when all its inner loops
are unrolled, for this case, all of its three loops are unrolled.
Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
for scalar codes it needs one more time dse run after cunroll to get array d
eliminated. I guess it's not costly? Can one pass be run or not controlled by
something in another pass? via global variable and add one parameter in
passes.def seems weird. If it's costly, probably we can go by factoring out one
routine to be called instead of running a pass, like do_rpo_vn?

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-21 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #17 from rguenther at suse dot de  ---
On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> 
> --- Comment #15 from Kewen Lin  ---
> (In reply to rguent...@suse.de from comment #14)
> > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > 
> > > --- Comment #13 from Kewen Lin  ---
> > > >   2) on Power, the conversion from unsigned char to unsigned short is 
> > > > nop
> > > > conversion, when we counting scalar cost, it's counted, then add costs 
> > > > 32
> > > > totally onto scalar cost. Meanwhile, the conversion from unsigned short 
> > > > to
> > > > signed short should be counted but it's not (need to check why 
> > > > further). 
> > > 
> > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's 
> > > not
> > > even put into the cost vector. 
> > > 
> > > tree_nop_conversion_p's comments saying:
> > > 
> > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > >no instruction.  */
> > > 
> > > I may miss something here, but UH to SH conversion does need one explicit
> > > extend instruction *extsh*, the precision/mode equality check looks wrong 
> > > for
> > > this conversion.
> > 
> > Well, it isn't a RTL predicate and it only needs extension because
> > there's never a HImode pseudo but always SImode subregs.
> 
> Thanks Richi! Should we take care of this case? or neglect this kind of
> extension as "no instruction"? I was intent to handle it in target specific
> code, but it isn't recorded into cost vector while it seems too heavy to do 
> the
> bb_info slp_instances revisits in finish_cost.

I think it's not something we should handle on GIMPLE.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #16 from Hongtao.liu  ---
I notice

0x5561dc0 _36 * 2 1 times scalar_stmt costs 16 in body
0x5561dc0 _38 * 2 1 times scalar_stmt costs 16 in body
0x5562df0 _36 * 2 1 times vector_stmt costs 16 in body
0x5562df0 _38 * 2 1 times vector_stmt costs 16 in body

ix86_multiplication_cost would be called for cost estimation, but in
pass_expand, synth_mult will tranform the multiplization to shift. So shift
cost should be used in this case, not mult.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-18 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #15 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #14)
> On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > 
> > --- Comment #13 from Kewen Lin  ---
> > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > signed short should be counted but it's not (need to check why further). 
> > 
> > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > even put into the cost vector. 
> > 
> > tree_nop_conversion_p's comments saying:
> > 
> > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> >no instruction.  */
> > 
> > I may miss something here, but UH to SH conversion does need one explicit
> > extend instruction *extsh*, the precision/mode equality check looks wrong 
> > for
> > this conversion.
> 
> Well, it isn't a RTL predicate and it only needs extension because
> there's never a HImode pseudo but always SImode subregs.

Thanks Richi! Should we take care of this case? or neglect this kind of
extension as "no instruction"? I was intent to handle it in target specific
code, but it isn't recorded into cost vector while it seems too heavy to do the
bb_info slp_instances revisits in finish_cost.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-18 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #14 from rguenther at suse dot de  ---
On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> 
> --- Comment #13 from Kewen Lin  ---
> >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > conversion, when we counting scalar cost, it's counted, then add costs 32
> > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > signed short should be counted but it's not (need to check why further). 
> 
> UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> even put into the cost vector. 
> 
> tree_nop_conversion_p's comments saying:
> 
> /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
>no instruction.  */
> 
> I may miss something here, but UH to SH conversion does need one explicit
> extend instruction *extsh*, the precision/mode equality check looks wrong for
> this conversion.

Well, it isn't a RTL predicate and it only needs extension because
there's never a HImode pseudo but always SImode subregs.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-18 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #13 from Kewen Lin  ---
>   2) on Power, the conversion from unsigned char to unsigned short is nop
> conversion, when we counting scalar cost, it's counted, then add costs 32
> totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> signed short should be counted but it's not (need to check why further). 

UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
even put into the cost vector. 

tree_nop_conversion_p's comments saying:

/* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
   no instruction.  */

I may miss something here, but UH to SH conversion does need one explicit
extend instruction *extsh*, the precision/mode equality check looks wrong for
this conversion.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #12 from Kewen Lin  ---

> Thanks for the explanation! I'll look at it after checking 2). IIUC, the
> advantage to eliminate stores here looks able to get those things which is
> fed to stores and stores' consumers bundled, then get more things SLP-ed if
> available?

Hmm, I think I was wrong, if both the feeding chain and consuming chain of the
stores are SLP-ed, later FRE would be able to fuse them.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #11 from Kewen Lin  ---
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores 
> > > > needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around 
> > > this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

Thanks for the explanation! I'll look at it after checking 2). IIUC, the
advantage to eliminate stores here looks able to get those things which is fed
to stores and stores' consumers bundled, then get more things SLP-ed if
available?

> 
> I think I looked at this or a related testcase some time ago and split out
> some PRs (can't find those right now).  For example we are not considering
> to simplify
> 

> 
> the load permutations suggest that splitting the group into 4-lane pieces
> would avoid doing permutes but then that would require target support
> for V4QI and V4HI vectors.  At least the loads could be considered
> to be vectorized with strided-SLP, yielding 'int' loads and a vector
> build from 4 ints.  I'd need to analyze why we do not consider this.

Good idea! Curious that is there some port where int load can not work well on
1-byte aligned address like trap?

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #10 from Richard Biener  ---
(In reply to Kewen Lin from comment #9)
> (In reply to Richard Biener from comment #8)
> > (In reply to Kewen Lin from comment #7)
> > > Two questions in mind, need to dig into it further:
> > >   1) from the assembly of scalar/vector code, I don't see any stores 
> > > needed
> > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > consider the stores.
> > 
> > Because when modeling they are still there.  There's no good way around 
> > this.
> > 
> 
> I noticed the stores get eliminated during FRE.  Can we consider running FRE
> once just before SLP? a bad idea due to compilation time?

Yeah, we already run FRE a lot and it is one of the more expensive passes.

Note there's one point we could do better which is the embedded SESE FRE
run from cunroll which is only run before we consider peeling an outer loop
and thus not for the outermost unrolled/peeled code (but the question would
be from where / up to what to apply FRE to).  On x86_64 this would apply to
the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
matters for costing though).

I think I looked at this or a related testcase some time ago and split out
some PRs (can't find those right now).  For example we are not considering
to simplify

  _318 = {_4, _14, _293, _30, _49, _251, _225, _248, _52, _70, _260, _284,
_100, _117, _134, _151};
  vect__5.47_319 = (vector(16) short unsigned int) _318;
  _154 = MEM[(pixel *)pix2_58(D) + 99B];
  _320 = {_6, _16, _22, _32, _51, _255, _231, _243, _54, _68, _276, _286, _103,
_120, _137, _154};
  vect__7.48_321 = (vector(16) short unsigned int) _320;
  vect__12.49_322 = vect__5.47_319 - vect__7.48_321;
  _317 = BIT_FIELD_REF ;
  _315 = BIT_FIELD_REF ;
  _313 = BIT_FIELD_REF ;
  _311 = BIT_FIELD_REF ;
  vect_perm_even_165 = VEC_PERM_EXPR <_317, _315, { 0, 2, 4, 6 }>;
  vect_perm_odd_164 = VEC_PERM_EXPR <_317, _315, { 1, 3, 5, 7 }>;
  vect_perm_even_163 = VEC_PERM_EXPR <_313, _311, { 0, 2, 4, 6 }>;
  vect_perm_odd_156 = VEC_PERM_EXPR <_313, _311, { 1, 3, 5, 7 }>;

down to smaller vectors.  Also appearantly the two vector CTORs are not
re-shuffled to vector load + shuffle.  In the SLP analysis we end up with

t2.c:12:32: note:   Final SLP tree for instance:
t2.c:12:32: note:   node 0x436e3c0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:   stmt 0 *_11 = _12;
t2.c:12:32: note:   stmt 1 *_21 = _71;
...
t2.c:12:32: note:   stmt 15 *_160 = _161;
t2.c:12:32: note:   children 0x436de70
t2.c:12:32: note:   node 0x436de70 (max_nunits=16, refcnt=2)
t2.c:12:32: note:   stmt 0 _12 = _5 - _7;
t2.c:12:32: note:   stmt 1 _71 = _15 - _17;
...
.c:12:32: note:   stmt 15 _161 = _152 - _155;
t2.c:12:32: note:   children 0x436ebb0 0x4360b70
t2.c:12:32: note:   node 0x436ebb0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:   stmt 0 _5 = (short unsigned int) _4;
...
t2.c:12:32: note:   stmt 15 _152 = (short unsigned int) _151;
t2.c:12:32: note:   children 0x42f1740
t2.c:12:32: note:   node 0x42f1740 (max_nunits=16, refcnt=2)
t2.c:12:32: note:   stmt 0 _4 = *pix1_57(D);
...
t2.c:12:32: note:   stmt 15 _151 = MEM[(pixel *)pix1_295 + 3B];
t2.c:12:32: note:   load permutation { 0 1 2 3 16 17 18 19 32 33 34 35 48
49 50 51 }
t2.c:12:32: note:   node 0x4360b70 (max_nunits=16, refcnt=2)
t2.c:12:32: note:   stmt 0 _7 = (short unsigned int) _6;
...
t2.c:12:32: note:   stmt 15 _155 = (short unsigned int) _154;
t2.c:12:32: note:   children 0x4360be0
t2.c:12:32: note:   node 0x4360be0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:   stmt 0 _6 = *pix2_58(D);
...
t2.c:12:32: note:   stmt 15 _154 = MEM[(pixel *)pix2_296 + 3B];
t2.c:12:32: note:   load permutation { 0 1 2 3 32 33 34 35 64 65 66 67 96
97 98 99 }

the load permutations suggest that splitting the group into 4-lane pieces
would avoid doing permutes but then that would require target support
for V4QI and V4HI vectors.  At least the loads could be considered
to be vectorized with strided-SLP, yielding 'int' loads and a vector
build from 4 ints.  I'd need to analyze why we do not consider this.

t2.c:50:1: note:   Detected interleaving load of size 52
t2.c:50:1: note:_4 = *pix1_57(D);
t2.c:50:1: note:_14 = MEM[(pixel *)pix1_57(D) + 1B];
t2.c:50:1: note:_293 = MEM[(pixel *)pix1_57(D) + 2B];
t2.c:50:1: note:_30 = MEM[(pixel *)pix1_57(D) + 3B];
t2.c:50:1: note:
t2.c:50:1: note:_49 = *pix1_40;
t2.c:50:1: note:_251 = MEM[(pixel *)pix1_40 + 1B];
t2.c:50:1: note:_225 = MEM[(pixel *)pix1_40 + 2B];
t2.c:50:1: note:_248 = MEM[(pixel *)pix1_40 + 3B];
t2.c:50:1: note:
t2.c:50:1: note:_52 = *pix1_264;
t2.c:50:1: note:_70 = MEM[(pixel *)pix1_264 + 1B];
t2.c:50:1: note:_260 = MEM[(pixel *)pix1_264 + 2B];
t2.c:50:1: note:  

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #9 from Kewen Lin  ---
(In reply to Richard Biener from comment #8)
> (In reply to Kewen Lin from comment #7)
> > Two questions in mind, need to dig into it further:
> >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > consider the stores.
> 
> Because when modeling they are still there.  There's no good way around this.
> 

I noticed the stores get eliminated during FRE.  Can we consider running FRE
once just before SLP? a bad idea due to compilation time?

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #8 from Richard Biener  ---
(In reply to Kewen Lin from comment #7)
> Two questions in mind, need to dig into it further:
>   1) from the assembly of scalar/vector code, I don't see any stores needed
> into temp array d (array diff in pixel_sub_wxh), but when modeling we
> consider the stores.

Because when modeling they are still there.  There's no good way around this.

> On Power two vector stores take cost 2 while 16 scalar
> stores takes cost 16, it seems wrong to cost model something useless. Later,
> for the vector version we need 16 vector halfword extractions from these two
> halfword vectors, while scalar version the values are just in GPR register,
> vector version looks inefficient.
>   2) on Power, the conversion from unsigned char to unsigned short is nop
> conversion, when we counting scalar cost, it's counted, then add costs 32
> totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> signed short should be counted but it's not (need to check why further). 
> The nop conversion costing looks something we can handle in function
> rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function
> tree_nop_conversion_p, but it's only for same mode/precision conversion.
> Will find/check something else.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin  changed:

   What|Removed |Added

   Last reconfirmed||2020-09-16
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

--- Comment #7 from Kewen Lin  ---
Two questions in mind, need to dig into it further:
  1) from the assembly of scalar/vector code, I don't see any stores needed
into temp array d (array diff in pixel_sub_wxh), but when modeling we consider
the stores. On Power two vector stores take cost 2 while 16 scalar stores takes
cost 16, it seems wrong to cost model something useless. Later, for the vector
version we need 16 vector halfword extractions from these two halfword vectors,
while scalar version the values are just in GPR register, vector version looks
inefficient.
  2) on Power, the conversion from unsigned char to unsigned short is nop
conversion, when we counting scalar cost, it's counted, then add costs 32
totally onto scalar cost. Meanwhile, the conversion from unsigned short to
signed short should be counted but it's not (need to check why further).  The
nop conversion costing looks something we can handle in function
rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function
tree_nop_conversion_p, but it's only for same mode/precision conversion. Will
find/check something else.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-30 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org

--- Comment #6 from Kewen Lin  ---
(In reply to Richard Biener from comment #4)
> This delays some checks to eventually support part of the BB vectorization
> which is what succeeds here.  I suspect that w/o vectorization we manage
> to elide the tmp[] array but with the part vectorization that occurs we
> fail to do that.
> 
> On the cost side there would be a lot needed to make the vectorization
> not profitable:
> 
>   Vector inside of basic block cost: 8
>   Vector prologue cost: 36
>   Vector epilogue cost: 0
>   Scalar cost of basic block: 64
> 
> the thing to double-check is
> 
> 0x123b1ff0  1 times vec_construct costs 17 in prologue
> 
> that is the cost of the V16QI construct
> 
>  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> _576, _125, _143, _161, _179}; 
> 

Thanks Richard!  I did some cost adjustment experiment last year and the cost
for v16qi looks off indeed, but at that time with the cost tweaking for this
the SPEC performance doesn't change, I guessed it's just we happened not have
this kind of case to trap into. I'll have a look and re-evaluate it for this.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #5 from Richard Biener  ---
testcase from https://github.com/mirror/x264/blob/master/common/dct.c

where FENC_STRIDE is 16 and FDEC_STRIDE 32

pixel is unsigned char, dctcoef is unsigned short

static inline void pixel_sub_wxh( dctcoef *diff, int i_size,
  pixel *pix1, int i_pix1, pixel *pix2, int
i_pix2 )
{
for( int y = 0; y < i_size; y++ )
{
for( int x = 0; x < i_size; x++ )
diff[x + y*i_size] = pix1[x] - pix2[x];
pix1 += i_pix1;
pix2 += i_pix2;
}
}

static void sub4x4_dct( dctcoef dct[16], pixel *pix1, pixel *pix2 )
{
dctcoef d[16];
dctcoef tmp[16];

pixel_sub_wxh( d, 4, pix1, FENC_STRIDE, pix2, FDEC_STRIDE );

for( int i = 0; i < 4; i++ )
{
int s03 = d[i*4+0] + d[i*4+3];
int s12 = d[i*4+1] + d[i*4+2];
int d03 = d[i*4+0] - d[i*4+3];
int d12 = d[i*4+1] - d[i*4+2];

tmp[0*4+i] =   s03 +   s12;
tmp[1*4+i] = 2*d03 +   d12;
tmp[2*4+i] =   s03 -   s12;
tmp[3*4+i] =   d03 - 2*d12;
}

for( int i = 0; i < 4; i++ )
{
int s03 = tmp[i*4+0] + tmp[i*4+3];
int s12 = tmp[i*4+1] + tmp[i*4+2];
int d03 = tmp[i*4+0] - tmp[i*4+3];
int d12 = tmp[i*4+1] - tmp[i*4+2];

dct[i*4+0] =   s03 +   s12;
dct[i*4+1] = 2*d03 +   d12;
dct[i*4+2] =   s03 -   s12;
dct[i*4+3] =   d03 - 2*d12;
}
}

one can see vectorization is complicated by the non-homogenous computes.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #4 from Richard Biener  ---
This delays some checks to eventually support part of the BB vectorization
which is what succeeds here.  I suspect that w/o vectorization we manage
to elide the tmp[] array but with the part vectorization that occurs we
fail to do that.

On the cost side there would be a lot needed to make the vectorization
not profitable:

  Vector inside of basic block cost: 8
  Vector prologue cost: 36
  Vector epilogue cost: 0
  Scalar cost of basic block: 64

the thing to double-check is

0x123b1ff0  1 times vec_construct costs 17 in prologue

that is the cost of the V16QI construct

 _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
_576, _125, _143, _161, _179}; 

maybe you can extract a testcase for the function as well?

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-26 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #3 from Kewen Lin  ---
Bisection shows it started to fail from r11-205.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-26 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #2 from Kewen Lin  ---
Created attachment 49124
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49124=edit
sub4x4_dct SLP dumping

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Richard Biener  changed:

   What|Removed |Added

  Component|tree-optimization   |target
   Keywords||missed-optimization
 Target||powerpc
 Blocks||53947

--- Comment #1 from Richard Biener  ---
Can you attach the relevant SLP vectorizer dump part (just for the function in
question)?


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations