Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-05-17 Thread Richard Biener
On Fri, 14 May 2021, Xionghu Luo wrote:

> Hi Richi,
> 
> On 2021/4/21 19:54, Richard Biener wrote:
> > On Tue, 20 Apr 2021, Xionghu Luo wrote:
> > 
> >>
> >>
> >> On 2021/4/15 19:34, Richard Biener wrote:
> >>> On Thu, 15 Apr 2021, Xionghu Luo wrote:
> >>>
>  Thanks,
> 
>  On 2021/4/14 14:41, Richard Biener wrote:
> >> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
> >> but it moves #538 first, then #235, there is strong dependency here. It
> >> seemsdoesn't like the LCM framework that could solve all and do the
> >> delete-insert in one iteration.
> > So my question was whether we want to do both within the LCM store
> > sinking framework.  The LCM dataflow is also used by RTL PRE which
> > handles both loads and non-loads so in principle it should be able
> > to handle stores and non-stores for the sinking case (PRE on the
> > reverse CFG).
> >
> > A global dataflow is more powerful than any local ad-hoc method.
> 
>  My biggest concern is whether the LCM DF framework could support sinking
>  *multiple* reverse-dependent non-store instructions together by *one*
>  calling of LCM DF.   If this is not supported, we need run multiple LCM
>  until no new changes, it would be time consuming obviously (unless
>  compiling time is not important here).
> >>>
> >>> As said it is used for PRE and there it most definitely can do that.
> >>
> >> I did some investigation about PRE and attached a case to show how it
> >> works, it is quite like store-motion, and actually there is a rtl-hoist
> >> pass in gcse.c which only works for code size.  All of them are
> >> leveraging the LCM framework to move instructions upward or downward.
> >>
> >> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
> >> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
> >> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
> >> The four problems are all converted to the LCM DF problem with
> >> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
> >> and two outputs of where to insert/delete.
> >>
> >> PRE scan each instruction and hash the SRC to table without *checking the
> >> relationship between instructions*, for the case attached, BB 37, BB 38
> >> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
> >> save it to index 106, BB 38 save it to index 110. After finishing this 
> >> pass,
> >> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
> >> expr to BB 75~BB 80 to create full redundancies from partial redundancies,
> >> finally update instruction in BB 37.
> > 
> > I'm not familiar with the actual PRE code but reading the toplevel comment
> > it seems that indeed it can only handle expressions contained in a single
> > insn unless a REG_EQUAL note provides a short-hand for the larger one.
> > 
> > That of course means it would need to mark things as not transparent
> > for correctness where they'd be if moved together.  Now, nothing
> > prevents you changing the granularity of what you feed LCM.
> > 
> > So originally we arrived at looking into LCM because there's already
> > a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
> > one didn't look like the best obvious solution.
> > 
> > That said, LCM would work for single-instruction expressions.
> > Alternatively a greedy algorithm like you prototyped could be used.
> > Another pass to look at would be RTL invariant motion which seems to
> > compute some kind of dependency graph - not sure if that would be
> > adaptable for the reverse CFG problem.
> > 
> 
> Actually my RTL sinking pass patch is borrowed from RTL loop invariant
> motion, it is  quite limited since only moves instructions from loop header
> to loop exits, though it could be refined with various of algorithms.
> Compared to the initial method of running gimple sink pass once more, 
> it seems much more complicated and limited without gaining obvious performance
> benefit, shall we turn back to consider gimple sink2 pass from original since
> we are in stage1 now?

OK, so while there might be new sinking opportunities exposed during
RTL expansion and early RTL opts we can consider adding another sink pass
on GIMPLE.  Since it's basically a scheduling optimization placement
shouldn't matter much but I suppose we should run it before store
merging, so anywhere between cd_dce and that.

Richard.


Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-05-14 Thread Xionghu Luo via Gcc-patches
Hi Richi,

On 2021/4/21 19:54, Richard Biener wrote:
> On Tue, 20 Apr 2021, Xionghu Luo wrote:
> 
>>
>>
>> On 2021/4/15 19:34, Richard Biener wrote:
>>> On Thu, 15 Apr 2021, Xionghu Luo wrote:
>>>
 Thanks,

 On 2021/4/14 14:41, Richard Biener wrote:
>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
>> but it moves #538 first, then #235, there is strong dependency here. It
>> seemsdoesn't like the LCM framework that could solve all and do the
>> delete-insert in one iteration.
> So my question was whether we want to do both within the LCM store
> sinking framework.  The LCM dataflow is also used by RTL PRE which
> handles both loads and non-loads so in principle it should be able
> to handle stores and non-stores for the sinking case (PRE on the
> reverse CFG).
>
> A global dataflow is more powerful than any local ad-hoc method.

 My biggest concern is whether the LCM DF framework could support sinking
 *multiple* reverse-dependent non-store instructions together by *one*
 calling of LCM DF.   If this is not supported, we need run multiple LCM
 until no new changes, it would be time consuming obviously (unless
 compiling time is not important here).
>>>
>>> As said it is used for PRE and there it most definitely can do that.
>>
>> I did some investigation about PRE and attached a case to show how it
>> works, it is quite like store-motion, and actually there is a rtl-hoist
>> pass in gcse.c which only works for code size.  All of them are
>> leveraging the LCM framework to move instructions upward or downward.
>>
>> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
>> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
>> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
>> The four problems are all converted to the LCM DF problem with
>> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
>> and two outputs of where to insert/delete.
>>
>> PRE scan each instruction and hash the SRC to table without *checking the
>> relationship between instructions*, for the case attached, BB 37, BB 38
>> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
>> save it to index 106, BB 38 save it to index 110. After finishing this pass,
>> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
>> expr to BB 75~BB 80 to create full redundancies from partial redundancies,
>> finally update instruction in BB 37.
> 
> I'm not familiar with the actual PRE code but reading the toplevel comment
> it seems that indeed it can only handle expressions contained in a single
> insn unless a REG_EQUAL note provides a short-hand for the larger one.
> 
> That of course means it would need to mark things as not transparent
> for correctness where they'd be if moved together.  Now, nothing
> prevents you changing the granularity of what you feed LCM.
> 
> So originally we arrived at looking into LCM because there's already
> a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
> one didn't look like the best obvious solution.
> 
> That said, LCM would work for single-instruction expressions.
> Alternatively a greedy algorithm like you prototyped could be used.
> Another pass to look at would be RTL invariant motion which seems to
> compute some kind of dependency graph - not sure if that would be
> adaptable for the reverse CFG problem.
> 

Actually my RTL sinking pass patch is borrowed from RTL loop invariant
motion, it is  quite limited since only moves instructions from loop header
to loop exits, though it could be refined with various of algorithms.
Compared to the initial method of running gimple sink pass once more, 
it seems much more complicated and limited without gaining obvious performance
benefit, shall we turn back to consider gimple sink2 pass from original since
we are in stage1 now?


Thanks,
Xionghu


Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-04-23 Thread Jeff Law via Gcc-patches



On 4/14/2021 12:41 AM, Richard Biener wrote:

On Wed, 14 Apr 2021, Xionghu Luo wrote:


Hi,

On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote:

Also we already have a sinking pass on RTL which even computes
a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
I'm not sure whether this deals with non-stores but the
LCM machinery definitely can handle arbitrary expressions.  I wonder
if it makes more sense to extend this rather than inventing a new
ad-hoc sinking pass?

  From the literal, my pass doesn't handle or process store instructions
like store-motion..  Thanks, will check it.

Store motion only processes store instructions with data flow equations,
generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it
by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map,
st_insert_map) globally, each store place is independently represented in
the input bitmap vectors. Output is which should be delete and where to
insert, current code does what you said "emit copies to a new pseudo at
the original insn location and use it in followed bb", actually it is
"store replacement" instead of "store move", why not save one pseudo by
moving the store instruction to target edge directly?

It probably simply saves the pass from doing analysis whether the
stored value is clobbered on the sinking path, enabling more store
sinking.  For stores that might be even beneficial, for non-stores
it becomes more of a cost issue, yes.


There are many differences between the newly added rtl-sink pass and
store-motion pass.
1. Store motion moves only store instructions, rtl-sink ignores store
instructions.
2. Store motion is a global DF problem solving, rtl-sink only processes
loop header reversely with dependency check in loop, take the below RTL
as example,
"#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
but it moves #538 first, then #235, there is strong dependency here. It
seemsdoesn't like the LCM framework that could solve all and do the
delete-insert in one iteration.

So my question was whether we want to do both within the LCM store
sinking framework.  The LCM dataflow is also used by RTL PRE which
handles both loads and non-loads so in principle it should be able
to handle stores and non-stores for the sinking case (PRE on the
reverse CFG).

A global dataflow is more powerful than any local ad-hoc method.


IIRC you can use LCM on stores like this, but you have to run it 
independently on each store to pick up the secondary effects.   I 
believe the basic concepts are discussed in Morgan's book.   That may 
turn out to be too expensive in practice -- I've never tried it though.


jeff




Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-04-21 Thread Richard Biener
On Tue, 20 Apr 2021, Xionghu Luo wrote:

> 
> 
> On 2021/4/15 19:34, Richard Biener wrote:
> > On Thu, 15 Apr 2021, Xionghu Luo wrote:
> > 
> >> Thanks,
> >>
> >> On 2021/4/14 14:41, Richard Biener wrote:
>  "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
>  but it moves #538 first, then #235, there is strong dependency here. It
>  seemsdoesn't like the LCM framework that could solve all and do the
>  delete-insert in one iteration.
> >>> So my question was whether we want to do both within the LCM store
> >>> sinking framework.  The LCM dataflow is also used by RTL PRE which
> >>> handles both loads and non-loads so in principle it should be able
> >>> to handle stores and non-stores for the sinking case (PRE on the
> >>> reverse CFG).
> >>>
> >>> A global dataflow is more powerful than any local ad-hoc method.
> >>
> >> My biggest concern is whether the LCM DF framework could support sinking
> >> *multiple* reverse-dependent non-store instructions together by *one*
> >> calling of LCM DF.   If this is not supported, we need run multiple LCM
> >> until no new changes, it would be time consuming obviously (unless
> >> compiling time is not important here).
> > 
> > As said it is used for PRE and there it most definitely can do that.
> 
> I did some investigation about PRE and attached a case to show how it
> works, it is quite like store-motion, and actually there is a rtl-hoist
> pass in gcse.c which only works for code size.  All of them are
> leveraging the LCM framework to move instructions upward or downward.
> 
> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
> The four problems are all converted to the LCM DF problem with
> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
> and two outputs of where to insert/delete.
> 
> PRE scan each instruction and hash the SRC to table without *checking the
> relationship between instructions*, for the case attached, BB 37, BB 38
> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
> save it to index 106, BB 38 save it to index 110. After finishing this pass,
> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
> expr to BB 75~BB 80 to create full redundancies from partial redundancies,
> finally update instruction in BB 37.

I'm not familiar with the actual PRE code but reading the toplevel comment
it seems that indeed it can only handle expressions contained in a single
insn unless a REG_EQUAL note provides a short-hand for the larger one.

That of course means it would need to mark things as not transparent
for correctness where they'd be if moved together.  Now, nothing
prevents you changing the granularity of what you feed LCM.

So originally we arrived at looking into LCM because there's already
a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
one didn't look like the best obvious solution.

That said, LCM would work for single-instruction expressions.  
Alternatively a greedy algorithm like you prototyped could be used.
Another pass to look at would be RTL invariant motion which seems to
compute some kind of dependency graph - not sure if that would be
adaptable for the reverse CFG problem.

> Issues witnessed in the PRE run:
> 1) "r262:DI+r139:DI" in BLOCK 38 is not replaced;
> 2) PRE also use pseudo register to store the result like store-motion and
> even rtl-hoist. Actually we need real "move" instead of "replace" for
> rtl-sink to improve performance though also potential register pressure issue
> like rtl-hoist?
> 3) SRC instruction is scanned WITHOUT back chain check cross instructions
> in hash_scan_set, it couldn't handle below case though a+c is same with b+d.
> So I am afraid single run couldn't solve the instruction dependency issue
> to sink multiple instructions out as before for rtl-sink.
> 
> BB1:
> a = b;
> c = d;
> s = a + c;
> BB2:
> s = b + d;
> 
> 
> gcse.c:
> changed = one_pre_gcse_pass ()
> alloc_hash_table (_hash_table);
> compute_hash_table (_hash_table);
> compute_hash_table_work (table);
>   FOR_EACH_BB_FN (current_bb, cfun)
>FOR_BB_INSNS (current_bb, insn)
> hash_scan_insn (insn, table);
>   hash_scan_set (pat, insn, table);
>if REG_P (dest)
>  insert_expr_in_table (src, GET_MODE (dest), insn, antic_p, avail_p,
>  max_distance, table);
>hash = hash_expr (x, mode, _not_record_p, table->size);
> dump_hash_table (dump_file, "Expression", _hash_table);
> edge_list = compute_pre_data ();
> compute_local_properties (transp, comp, antloc, _hash_table);
> edge_list = pre_edge_lcm (expr_hash_table.n_elems, transp, comp, antloc,
> ae_kill, _insert_map, _delete_map);
> changed |= pre_gcse (edge_list);
> changed = pre_delete ();  /* Create a pseudo-reg to store the result of
> reaching 

Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-04-15 Thread Richard Biener
On Thu, 15 Apr 2021, Xionghu Luo wrote:

> Thanks,
> 
> On 2021/4/14 14:41, Richard Biener wrote:
> >> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
> >> but it moves #538 first, then #235, there is strong dependency here. It
> >> seemsdoesn't like the LCM framework that could solve all and do the
> >> delete-insert in one iteration.
> > So my question was whether we want to do both within the LCM store
> > sinking framework.  The LCM dataflow is also used by RTL PRE which
> > handles both loads and non-loads so in principle it should be able
> > to handle stores and non-stores for the sinking case (PRE on the
> > reverse CFG).
> > 
> > A global dataflow is more powerful than any local ad-hoc method.
> 
> My biggest concern is whether the LCM DF framework could support sinking
> *multiple* reverse-dependent non-store instructions together by *one*
> calling of LCM DF.   If this is not supported, we need run multiple LCM
> until no new changes, it would be time consuming obviously (unless
> compiling time is not important here).

As said it is used for PRE and there it most definitely can do that.

> 
> > 
> > Richard.
> > 
> >> However, there are still some common methods could be shared, like the
> >> def-use check(though store-motion is per bb, rtl-sink is per loop),
> >> insert_store, commit_edge_insertions etc.
> >>
> >>
> >>508: L508:
> >>507: NOTE_INSN_BASIC_BLOCK 34
> >> 12: r139:DI=r140:DI
> >>REG_DEAD r140:DI
> >>240: L240:
> >>231: NOTE_INSN_BASIC_BLOCK 35
> >>232: r142:DI=zero_extend(r139:DI#0)
> >>233: r371:SI=r142:DI#0-0x1
> >>234: r243:DI=zero_extend(r371:SI)
> >>REG_DEAD r371:SI
> >>235: r452:DI=r262:DI+r139:DI
> >>538: r194:DI=r452:DI
> >>236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
> 
> 
> Like here, Each instruction's dest reg is calculated in the input vector
> bitmap, after solving the equations by calling pre_edge_rev_lcm, 
> move #538 out of loop for the first call, then move #235 out of loop
> after a second call... 4 repeat calls needed in total here, is the LCM
> framework smart enough to move the all 4 instruction within one iteration?
> I am worried that the input vector bitmap couldn't solve the dependency
> problem for two back chained instructions.
> 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)


Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-04-15 Thread Xionghu Luo via Gcc-patches
Thanks,

On 2021/4/14 14:41, Richard Biener wrote:
>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
>> but it moves #538 first, then #235, there is strong dependency here. It
>> seemsdoesn't like the LCM framework that could solve all and do the
>> delete-insert in one iteration.
> So my question was whether we want to do both within the LCM store
> sinking framework.  The LCM dataflow is also used by RTL PRE which
> handles both loads and non-loads so in principle it should be able
> to handle stores and non-stores for the sinking case (PRE on the
> reverse CFG).
> 
> A global dataflow is more powerful than any local ad-hoc method.

My biggest concern is whether the LCM DF framework could support sinking
*multiple* reverse-dependent non-store instructions together by *one*
calling of LCM DF.   If this is not supported, we need run multiple LCM
until no new changes, it would be time consuming obviously (unless
compiling time is not important here).

> 
> Richard.
> 
>> However, there are still some common methods could be shared, like the
>> def-use check(though store-motion is per bb, rtl-sink is per loop),
>> insert_store, commit_edge_insertions etc.
>>
>>
>>508: L508:
>>507: NOTE_INSN_BASIC_BLOCK 34
>> 12: r139:DI=r140:DI
>>REG_DEAD r140:DI
>>240: L240:
>>231: NOTE_INSN_BASIC_BLOCK 35
>>232: r142:DI=zero_extend(r139:DI#0)
>>233: r371:SI=r142:DI#0-0x1
>>234: r243:DI=zero_extend(r371:SI)
>>REG_DEAD r371:SI
>>235: r452:DI=r262:DI+r139:DI
>>538: r194:DI=r452:DI
>>236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)


Like here, Each instruction's dest reg is calculated in the input vector
bitmap, after solving the equations by calling pre_edge_rev_lcm, 
move #538 out of loop for the first call, then move #235 out of loop
after a second call... 4 repeat calls needed in total here, is the LCM
framework smart enough to move the all 4 instruction within one iteration?
I am worried that the input vector bitmap couldn't solve the dependency
problem for two back chained instructions.


-- 
Thanks,
Xionghu


Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-04-14 Thread Richard Biener
On Wed, 14 Apr 2021, Xionghu Luo wrote:

> Hi,
> 
> On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote:
> >> Also we already have a sinking pass on RTL which even computes
> >> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
> >> I'm not sure whether this deals with non-stores but the
> >> LCM machinery definitely can handle arbitrary expressions.  I wonder
> >> if it makes more sense to extend this rather than inventing a new
> >> ad-hoc sinking pass?
> >  From the literal, my pass doesn't handle or process store instructions
> > like store-motion..  Thanks, will check it.
> 
> Store motion only processes store instructions with data flow equations,
> generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it
> by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map,
> st_insert_map) globally, each store place is independently represented in
> the input bitmap vectors. Output is which should be delete and where to
> insert, current code does what you said "emit copies to a new pseudo at
> the original insn location and use it in followed bb", actually it is
> "store replacement" instead of "store move", why not save one pseudo by
> moving the store instruction to target edge directly?

It probably simply saves the pass from doing analysis whether the
stored value is clobbered on the sinking path, enabling more store
sinking.  For stores that might be even beneficial, for non-stores
it becomes more of a cost issue, yes.

> There are many differences between the newly added rtl-sink pass and
> store-motion pass. 
> 1. Store motion moves only store instructions, rtl-sink ignores store
> instructions. 
> 2. Store motion is a global DF problem solving, rtl-sink only processes
> loop header reversely with dependency check in loop, take the below RTL
> as example,
> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
> but it moves #538 first, then #235, there is strong dependency here. It
> seemsdoesn't like the LCM framework that could solve all and do the
> delete-insert in one iteration. 

So my question was whether we want to do both within the LCM store
sinking framework.  The LCM dataflow is also used by RTL PRE which
handles both loads and non-loads so in principle it should be able
to handle stores and non-stores for the sinking case (PRE on the
reverse CFG).

A global dataflow is more powerful than any local ad-hoc method.

Richard.

> However, there are still some common methods could be shared, like the
> def-use check(though store-motion is per bb, rtl-sink is per loop),
> insert_store, commit_edge_insertions etc.
> 
> 
>   508: L508:
>   507: NOTE_INSN_BASIC_BLOCK 34
>12: r139:DI=r140:DI
>   REG_DEAD r140:DI
>   240: L240:
>   231: NOTE_INSN_BASIC_BLOCK 35
>   232: r142:DI=zero_extend(r139:DI#0)
>   233: r371:SI=r142:DI#0-0x1
>   234: r243:DI=zero_extend(r371:SI)
>   REG_DEAD r371:SI
>   235: r452:DI=r262:DI+r139:DI
>   538: r194:DI=r452:DI
>   236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
>   237: pc={(geu(r372:CCUNS,0))?L246:pc}
>   REG_DEAD r372:CCUNS
>   REG_BR_PROB 59055804
>   238: NOTE_INSN_BASIC_BLOCK 36
>   239: r140:DI=r139:DI+0x1
>   241: r373:DI=r251:DI-0x1
>   242: r374:SI=zero_extend([r262:DI+r139:DI])
>   REG_DEAD r139:DI
>   243: r375:SI=zero_extend([r373:DI+r140:DI])
>   REG_DEAD r373:DI
>   244: r376:CC=cmp(r374:SI,r375:SI)
>   REG_DEAD r375:SI
>   REG_DEAD r374:SI
>   245: pc={(r376:CC==0)?L508:pc}
>   REG_DEAD r376:CC
>   REG_BR_PROB 1014686028
>   246: L246:
>   247: NOTE_INSN_BASIC_BLOCK 37
>   248: r377:SI=r142:DI#0-0x2
>   REG_DEAD r142:DI
>   249: r256:DI=zero_extend(r377:SI)
> 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)


Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-04-13 Thread Xionghu Luo via Gcc-patches
Hi,

On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote:
>> Also we already have a sinking pass on RTL which even computes
>> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
>> I'm not sure whether this deals with non-stores but the
>> LCM machinery definitely can handle arbitrary expressions.  I wonder
>> if it makes more sense to extend this rather than inventing a new
>> ad-hoc sinking pass?
>  From the literal, my pass doesn't handle or process store instructions
> like store-motion..  Thanks, will check it.

Store motion only processes store instructions with data flow equations,
generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it
by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map,
st_insert_map) globally, each store place is independently represented in
the input bitmap vectors. Output is which should be delete and where to
insert, current code does what you said "emit copies to a new pseudo at
the original insn location and use it in followed bb", actually it is
"store replacement" instead of "store move", why not save one pseudo by
moving the store instruction to target edge directly?

There are many differences between the newly added rtl-sink pass and
store-motion pass. 
1. Store motion moves only store instructions, rtl-sink ignores store
instructions. 
2. Store motion is a global DF problem solving, rtl-sink only processes
loop header reversely with dependency check in loop, take the below RTL
as example,
"#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
but it moves #538 first, then #235, there is strong dependency here. It
seemsdoesn't like the LCM framework that could solve all and do the
delete-insert in one iteration. 

However, there are still some common methods could be shared, like the
def-use check(though store-motion is per bb, rtl-sink is per loop),
insert_store, commit_edge_insertions etc.


  508: L508:
  507: NOTE_INSN_BASIC_BLOCK 34
   12: r139:DI=r140:DI
  REG_DEAD r140:DI
  240: L240:
  231: NOTE_INSN_BASIC_BLOCK 35
  232: r142:DI=zero_extend(r139:DI#0)
  233: r371:SI=r142:DI#0-0x1
  234: r243:DI=zero_extend(r371:SI)
  REG_DEAD r371:SI
  235: r452:DI=r262:DI+r139:DI
  538: r194:DI=r452:DI
  236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
  237: pc={(geu(r372:CCUNS,0))?L246:pc}
  REG_DEAD r372:CCUNS
  REG_BR_PROB 59055804
  238: NOTE_INSN_BASIC_BLOCK 36
  239: r140:DI=r139:DI+0x1
  241: r373:DI=r251:DI-0x1
  242: r374:SI=zero_extend([r262:DI+r139:DI])
  REG_DEAD r139:DI
  243: r375:SI=zero_extend([r373:DI+r140:DI])
  REG_DEAD r373:DI
  244: r376:CC=cmp(r374:SI,r375:SI)
  REG_DEAD r375:SI
  REG_DEAD r374:SI
  245: pc={(r376:CC==0)?L508:pc}
  REG_DEAD r376:CC
  REG_BR_PROB 1014686028
  246: L246:
  247: NOTE_INSN_BASIC_BLOCK 37
  248: r377:SI=r142:DI#0-0x2
  REG_DEAD r142:DI
  249: r256:DI=zero_extend(r377:SI)


-- 
Thanks,
Xionghu


Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-03-26 Thread Xionghu Luo via Gcc-patches
Hi, sorry for late response,

On 2021/3/23 16:50, Richard Biener wrote:
>>> It definitely should be before uncprop (but context stops there). And yes,
>>> re-running passes isn't the very, very best thing to do without explaining
>>> it cannot be done in other ways. Not for late stage 3 anyway.
>>>
>>> Richard.
>>>
>> Thanks.  Also tried to implement this in a seperate RTL pass, which
>> would be better?  I guess this would also be stage1 issues...
> Yes, that's also for stage1 obviously.  Can you check how the number
> of sink opportunities of your RTL pass changes if you add the
> 2nd GIMPLE sinking pass?

Number of Instructions sank out of loop when running (no sink2 -> sink2):
1. SPEC2017 int: 371  ->  142
2. SPEC2017 float: 949  ->  343
3. bootstrap: 402  ->  229
4. stage1 libraries: 115   ->  68
5. regression tests: 4533 ->  2948

the case I used from the beginning could all be optimized by gimple sink2
instead, but there are still many instructions sunk even gimple sink2 is
added, I guess most of them are produced by expand pass.  One example(It was
after #38 in 262r.reginfo, note that new block is created between
exit->src and exit->dst to avoid other bb jumps into exit->dst cause
execution error due to r132:DI updated unexpectedly.), sometimes extra
zero extend in loop like this could cause serious performance issue:

vect-live-4.ltrans0.ltrans.263r.sink:
...
Loop 2: sinking (set (reg/v:DI 132 [  ])
(sign_extend:DI (reg:SI 144))) from bb 7 to bb 11

...

   44: L44:
   35: NOTE_INSN_BASIC_BLOCK 7
   36: r127:DI=r127:DI+0x4
   37: r145:SI=[r127:DI]
   38: r144:SI=r145:SI+0x5
  REG_DEAD r145:SI
   40: r125:DI=r125:DI+0x4
   41: [r125:DI]=r144:SI
  REG_DEAD r144:SI
   42: r146:SI=r126:DI#0-0x1
  REG_DEAD r126:DI
   43: r126:DI=zero_extend(r146:SI)
  REG_DEAD r146:SI
   45: r147:CC=cmp(r126:DI,0)
   46: pc={(r147:CC!=0)?L65:pc}
  REG_DEAD r147:CC
  REG_BR_PROB 941032164
   68: NOTE_INSN_BASIC_BLOCK 11
   69: r132:DI=sign_extend(r144:SI)
  ; pc falls through to BB 8
   65: L65:
   64: NOTE_INSN_BASIC_BLOCK 9
  ; pc falls through to BB 7
   47: L47:
   48: NOTE_INSN_BASIC_BLOCK 8

> 
> I'll note that you are not sinking later stmts first (you only
> walk insns reverse but not BBs).  GIMPLE sinking performs a
> domwalk over post dominators (but it has an easier job because
> of PHIs).  I guess you'd want to walk starting from loop exit
> blocks (from innermost loops as you do) in reverse program order.

Yes, this rtl sink could only sink instruction from *loop header*
to every loop exit blocks in reverse order, it is a bit strict
since this is the first step to see whether it is reasonable to add
such a pass.
For example, if the instruction is in loop body block, sink it out will
cause execution error sometimes as it doesn't have information in it
whether the loop body block will be executed or not, if the loop jumps
from header to exit directly, the instructions sunk from body to exit
will change register value unexpected, seems always_reached and
always_executed in loop-invarinat.c could be reused here to determine
such circumstance?

> 
> I'll also note that you don't have to check whether stmts you
> want to sink have their uses set after it - you can emit copies
> to a new pseudo at the original insn location and use those
> after the loop (that of course comes at some cost).

Not sure whether I understood your point correctly, but the instruction
is still in loop executed loop niter times?
What I am trying to do is move r132:DI=sign_extend(r144:SI)
out of loop, if it was executed in loop 100 times, r132 is not used
in loop, and r144 is not updated after current instruction, then move
it to loop exit to execute one once. 

> 
> Also we already have a sinking pass on RTL which even computes
> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
> I'm not sure whether this deals with non-stores but the
> LCM machinery definitely can handle arbitrary expressions.  I wonder
> if it makes more sense to extend this rather than inventing a new
> ad-hoc sinking pass?

>From the literal, my pass doesn't handle or process store instructions
like store-motion..  Thanks, will check it.

-- 
Thanks,
Xionghu


Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-03-23 Thread Richard Biener
On Tue, 23 Mar 2021, Xionghu Luo wrote:

> 
> 
> On 2020/12/23 00:53, Richard Biener wrote:
> > On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo
> >  wrote:
> >> Here comes another case that requires run a pass once more, as this is
> >> not the common suggested direction to solve problems, not quite sure
> >> whether it is still a reasonble fix here.  Source code is something
> >> like:
> >>
> >> ref = ip + *hslot;
> >> while (ip < in_end - 2) {
> >>   unsigned int len = 2;
> >>   len++;
> >> for ()   {
> >>   do len++;
> >>   while (len < maxlen && ref[len] == ip[len]); //sink code here.
> >>   break;
> >> }
> >>   len -= 2;
> >>   ip++;
> >>   ip += len + 1;
> >>   if (ip >= in_end - 2)
> >> break;
> >> }
> >>
> >> Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:
> >>
> >>[local count: 75120046]:
> >>   # len_160 = PHI 
> >>   len_189 = len_160 + 1;
> >>   _423 = (sizetype) len_189;
> >>   _424 = ip_229 + _423;
> >>   if (maxlen_186 > len_189)
> >> goto ; [94.50%]
> >>   else
> >> goto ; [5.50%]
> >>
> >>[local count: 70988443]:
> >>   _84 = *_424;
> >>   _86 = ref_182 + _423;
> >>   _87 = *_86;
> >>   if (_84 == _87)
> >> goto ; [94.50%]
> >>   else
> >> goto ; [5.50%]
> >>
> >>[local count: 67084079]:
> >>   goto ; [100.00%]
> >>
> >>[local count: 14847855]:
> >>   # len_263 = PHI 
> >>   # _262 = PHI <_423(32), _423(31)>
> >>   # _264 = PHI <_424(32), _424(31)>
> >>   len_190 = len_263 + 4294967295;
> >>   if (len_190 <= 6)
> >> goto ; [0.00%]
> >>   else
> >> goto ; [100.00%]
> >>
> >> Then in ivopts, instructions are updated to xxx.c.174t.ivopts:
> >>
> >>[local count: 75120046]:
> >>   # ivtmp.30_29 = PHI 
> >>   _34 = (unsigned int) ivtmp.30_29;
> >>   len_160 = _34 + 4294967295;
> >>   _423 = ivtmp.30_29;
> >>   _35 = (unsigned long) ip_229;
> >>   _420 = ivtmp.30_29 + _35;
> >>   _419 = (uint8_t *) _420;
> >>   _424 = _419;
> >>   len_418 = (unsigned int) ivtmp.30_29;
> >>   if (maxlen_186 > len_418)
> >> goto ; [94.50%]
> >>   else
> >> goto ; [5.50%]
> >>
> >>[local count: 70988443]:
> >>   _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
> >>   ivtmp.30_31 = ivtmp.30_29 + 1;
> >>   _417 = ref_182 + 18446744073709551615;
> >>   _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
> >>   if (_84 == _87)
> >> goto ; [94.50%]
> >>   else
> >> goto ; [5.50%]
> >>
> >>[local count: 67084079]:
> >>   goto ; [100.00%]
> >>
> >>[local count: 14847855]:
> >>   # len_263 = PHI 
> >>   # _262 = PHI <_423(32), _423(31)>
> >>   # _264 = PHI <_424(32), _424(31)>
> >>   len_190 = len_263 + 4294967295;
> >>   if (len_190 <= 6)
> >> goto ; [0.00%]
> >>   else
> >> goto ; [100.00%]
> >>
> >> Some instructions in BB 31 are not used in the loop and could be sinked
> >> out of loop to reduce the computation, but they are not sinked
> >> throughout all passes later.  Run the sink_code pass once more at least
> >> after fre5 could improve this typical case performance 23% due to few
> >> instructions exausted in loop.
> >> xxx.c.209t.sink2:
> >>
> >> Sinking _419 = (uint8_t *) _420;
> >> from bb 31 to bb 89
> >> Sinking _420 = ivtmp.30_29 + _35;
> >> from bb 31 to bb 89
> >> Sinking _35 = (unsigned long) ip_229;
> >> from bb 31 to bb 89
> >> Sinking len_160 = _34 + 4294967295;
> >> from bb 31 to bb 33
> >>
> >> I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
> >> by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
> >> small with 0.25%.
> >>
> >> The reason why it should be run after fre5 is fre would do some phi
> >> optimization to expose the optimization.  The patch put it after
> >> pass_modref is due to my guess that some gimple optimizations like
> >> thread_jumps, dse, dce etc. could provide more opportunities for
> >> sinking code.  Not sure it is the correct place to put.  I also
> >> verified this issue exists in both X86 and ARM64.
> >> Any comments?  Thanks.
> > 
> > It definitely should be before uncprop (but context stops there). And yes,
> > re-running passes isn't the very, very best thing to do without explaining
> > it cannot be done in other ways. Not for late stage 3 anyway.
> > 
> > Richard.
> > 
> 
> Thanks.  Also tried to implement this in a seperate RTL pass, which
> would be better?  I guess this would also be stage1 issues...

Yes, that's also for stage1 obviously.  Can you check how the number
of sink opportunities of your RTL pass changes if you add the
2nd GIMPLE sinking pass?

I'll note that you are not sinking later stmts first (you only
walk insns reverse but not BBs).  GIMPLE sinking performs a
domwalk over post dominators (but it has an easier job because
of PHIs).  I guess you'd want to walk starting from loop exit
blocks (from innermost loops as you do) in reverse program order.

I'll also note that you don't have to check whether stmts you
want to sink have their uses set after it - you can emit copies
to a new pseudo 

Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-03-22 Thread Xionghu Luo via Gcc-patches



On 2020/12/23 00:53, Richard Biener wrote:

On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo  
wrote:

Here comes another case that requires run a pass once more, as this is
not the common suggested direction to solve problems, not quite sure
whether it is still a reasonble fix here.  Source code is something
like:

ref = ip + *hslot;
while (ip < in_end - 2) {
  unsigned int len = 2;
  len++;
for ()   {
  do len++;
  while (len < maxlen && ref[len] == ip[len]); //sink code here.
  break;
}
  len -= 2;
  ip++;
  ip += len + 1;
  if (ip >= in_end - 2)
break;
}

Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:

   [local count: 75120046]:
  # len_160 = PHI 
  len_189 = len_160 + 1;
  _423 = (sizetype) len_189;
  _424 = ip_229 + _423;
  if (maxlen_186 > len_189)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 70988443]:
  _84 = *_424;
  _86 = ref_182 + _423;
  _87 = *_86;
  if (_84 == _87)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 67084079]:
  goto ; [100.00%]

   [local count: 14847855]:
  # len_263 = PHI 
  # _262 = PHI <_423(32), _423(31)>
  # _264 = PHI <_424(32), _424(31)>
  len_190 = len_263 + 4294967295;
  if (len_190 <= 6)
goto ; [0.00%]
  else
goto ; [100.00%]

Then in ivopts, instructions are updated to xxx.c.174t.ivopts:

   [local count: 75120046]:
  # ivtmp.30_29 = PHI 
  _34 = (unsigned int) ivtmp.30_29;
  len_160 = _34 + 4294967295;
  _423 = ivtmp.30_29;
  _35 = (unsigned long) ip_229;
  _420 = ivtmp.30_29 + _35;
  _419 = (uint8_t *) _420;
  _424 = _419;
  len_418 = (unsigned int) ivtmp.30_29;
  if (maxlen_186 > len_418)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 70988443]:
  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
  ivtmp.30_31 = ivtmp.30_29 + 1;
  _417 = ref_182 + 18446744073709551615;
  _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
  if (_84 == _87)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 67084079]:
  goto ; [100.00%]

   [local count: 14847855]:
  # len_263 = PHI 
  # _262 = PHI <_423(32), _423(31)>
  # _264 = PHI <_424(32), _424(31)>
  len_190 = len_263 + 4294967295;
  if (len_190 <= 6)
goto ; [0.00%]
  else
goto ; [100.00%]

Some instructions in BB 31 are not used in the loop and could be sinked
out of loop to reduce the computation, but they are not sinked
throughout all passes later.  Run the sink_code pass once more at least
after fre5 could improve this typical case performance 23% due to few
instructions exausted in loop.
xxx.c.209t.sink2:

Sinking _419 = (uint8_t *) _420;
from bb 31 to bb 89
Sinking _420 = ivtmp.30_29 + _35;
from bb 31 to bb 89
Sinking _35 = (unsigned long) ip_229;
from bb 31 to bb 89
Sinking len_160 = _34 + 4294967295;
from bb 31 to bb 33

I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
small with 0.25%.

The reason why it should be run after fre5 is fre would do some phi
optimization to expose the optimization.  The patch put it after
pass_modref is due to my guess that some gimple optimizations like
thread_jumps, dse, dce etc. could provide more opportunities for
sinking code.  Not sure it is the correct place to put.  I also
verified this issue exists in both X86 and ARM64.
Any comments?  Thanks.


It definitely should be before uncprop (but context stops there). And yes, 
re-running passes isn't the very, very best thing to do without explaining it 
cannot be done in other ways. Not for late stage 3 anyway.

Richard.



Thanks.  Also tried to implement this in a seperate RTL pass, which
would be better?  I guess this would also be stage1 issues...


Xionghu
From ac6e161a592ea259f2807f40d1021ccbfc3a965f Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo 
Date: Thu, 4 Mar 2021 05:05:19 -0600
Subject: [PATCH] RTL loop sink pass

This is a rtl loop sink pass that check loop header instructions,
if the instruction's dest is not used in loop and it's source is not
updated after current instruction, then this instruction's result is
not used but calculated in loop each itration, sink it out of loop could
reduce executions theoretically(register pressure is not considered yet).

Number of Instructions sank out of loop when running on P8LE:
1. SPEC2017 int: 371
2. SPEC2017 float: 949
3. bootstrap: 402
4. stage1 libraries: 115
5. regression tests: 4533

Though no obvious performance change for SPEC2017, rtl-sink-2.c could
achieve 10% performance improvement by sinking 4 instructions out of
loop header.  This is a bit like run gimple sink pass twice I pasted
several months ago, is implementing this in RTL better?

https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562352.html

gcc/ChangeLog:

* Makefile.in:
* dbgcnt.def (DEBUG_COUNTER):
* passes.def:
* timevar.def (TV_TREE_SINK):
(TV_SINK):
* tree-pass.h (make_pass_rtl_sink):
* sink.c: New file.

gcc/testsuite/ChangeLog:

Re: [RFC] Run pass_sink_code once more after ivopts/fre

2020-12-22 Thread Richard Biener
On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo  
wrote:
>Here comes another case that requires run a pass once more, as this is
>not the common suggested direction to solve problems, not quite sure
>whether it is still a reasonble fix here.  Source code is something
>like:
>
>ref = ip + *hslot;
>while (ip < in_end - 2) {
>  unsigned int len = 2;
>  len++;
>for ()   {
>  do len++;
>  while (len < maxlen && ref[len] == ip[len]); //sink code here.
>  break;
>}
>  len -= 2;
>  ip++;
>  ip += len + 1;
>  if (ip >= in_end - 2)
>break;
>}
>
>Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:
>
>   [local count: 75120046]:
>  # len_160 = PHI 
>  len_189 = len_160 + 1;
>  _423 = (sizetype) len_189;
>  _424 = ip_229 + _423;
>  if (maxlen_186 > len_189)
>goto ; [94.50%]
>  else
>goto ; [5.50%]
>
>   [local count: 70988443]:
>  _84 = *_424;
>  _86 = ref_182 + _423;
>  _87 = *_86;
>  if (_84 == _87)
>goto ; [94.50%]
>  else
>goto ; [5.50%]
>
>   [local count: 67084079]:
>  goto ; [100.00%]
>
>   [local count: 14847855]:
>  # len_263 = PHI 
>  # _262 = PHI <_423(32), _423(31)>
>  # _264 = PHI <_424(32), _424(31)>
>  len_190 = len_263 + 4294967295;
>  if (len_190 <= 6)
>goto ; [0.00%]
>  else
>goto ; [100.00%]
>
>Then in ivopts, instructions are updated to xxx.c.174t.ivopts:
>
>   [local count: 75120046]:
>  # ivtmp.30_29 = PHI 
>  _34 = (unsigned int) ivtmp.30_29;
>  len_160 = _34 + 4294967295;
>  _423 = ivtmp.30_29;
>  _35 = (unsigned long) ip_229;
>  _420 = ivtmp.30_29 + _35;
>  _419 = (uint8_t *) _420;
>  _424 = _419;
>  len_418 = (unsigned int) ivtmp.30_29;
>  if (maxlen_186 > len_418)
>goto ; [94.50%]
>  else
>goto ; [5.50%]
>
>   [local count: 70988443]:
>  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
>  ivtmp.30_31 = ivtmp.30_29 + 1;
>  _417 = ref_182 + 18446744073709551615;
>  _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
>  if (_84 == _87)
>goto ; [94.50%]
>  else
>goto ; [5.50%]
>
>   [local count: 67084079]:
>  goto ; [100.00%]
>
>   [local count: 14847855]:
>  # len_263 = PHI 
>  # _262 = PHI <_423(32), _423(31)>
>  # _264 = PHI <_424(32), _424(31)>
>  len_190 = len_263 + 4294967295;
>  if (len_190 <= 6)
>goto ; [0.00%]
>  else
>goto ; [100.00%]
>
>Some instructions in BB 31 are not used in the loop and could be sinked
>out of loop to reduce the computation, but they are not sinked
>throughout all passes later.  Run the sink_code pass once more at least
>after fre5 could improve this typical case performance 23% due to few
>instructions exausted in loop.
>xxx.c.209t.sink2:
>
>Sinking _419 = (uint8_t *) _420;
> from bb 31 to bb 89
>Sinking _420 = ivtmp.30_29 + _35;
> from bb 31 to bb 89
>Sinking _35 = (unsigned long) ip_229;
> from bb 31 to bb 89
>Sinking len_160 = _34 + 4294967295;
> from bb 31 to bb 33
>
>I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
>by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
>small with 0.25%.
>
>The reason why it should be run after fre5 is fre would do some phi
>optimization to expose the optimization.  The patch put it after
>pass_modref is due to my guess that some gimple optimizations like
>thread_jumps, dse, dce etc. could provide more opportunities for
>sinking code.  Not sure it is the correct place to put.  I also
>verified this issue exists in both X86 and ARM64.
>Any comments?  Thanks.

It definitely should be before uncprop (but context stops there). And yes, 
re-running passes isn't the very, very best thing to do without explaining it 
cannot be done in other ways. Not for late stage 3 anyway. 

Richard. 

>---
> gcc/passes.def  | 1 +
> gcc/tree-ssa-sink.c | 1 +
> 2 files changed, 2 insertions(+)
>
>diff --git a/gcc/passes.def b/gcc/passes.def
>index 21b2e2af0f7..69106615729 100644
>--- a/gcc/passes.def
>+++ b/gcc/passes.def
>@@ -355,6 +355,7 @@ along with GCC; see the file COPYING3.  If not see
>   NEXT_PASS (pass_uncprop);
>   NEXT_PASS (pass_local_pure_const);
>   NEXT_PASS (pass_modref);
>+  NEXT_PASS (pass_sink_code);
>   POP_INSERT_PASSES ()
>   NEXT_PASS (pass_all_optimizations_g);
>   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations_g)
>diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c
>index b0abf4147d6..824659f3919 100644
>--- a/gcc/tree-ssa-sink.c
>+++ b/gcc/tree-ssa-sink.c
>@@ -819,6 +819,7 @@ public:
>   /* opt_pass methods: */
>   virtual bool gate (function *) { return flag_tree_sink != 0; }
>   virtual unsigned int execute (function *);
>+  opt_pass *clone (void) { return new pass_sink_code (m_ctxt); }
> 
> }; // class pass_sink_code
> 



[RFC] Run pass_sink_code once more after ivopts/fre

2020-12-21 Thread Xiong Hu Luo via Gcc-patches
Here comes another case that requires run a pass once more, as this is
not the common suggested direction to solve problems, not quite sure
whether it is still a reasonble fix here.  Source code is something like:

ref = ip + *hslot;
while (ip < in_end - 2) {
  unsigned int len = 2;
  len++;
for ()   {
  do len++;
  while (len < maxlen && ref[len] == ip[len]); //sink code here.
  break;
}
  len -= 2;
  ip++;
  ip += len + 1;
  if (ip >= in_end - 2)
break;
}

Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:

   [local count: 75120046]:
  # len_160 = PHI 
  len_189 = len_160 + 1;
  _423 = (sizetype) len_189;
  _424 = ip_229 + _423;
  if (maxlen_186 > len_189)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 70988443]:
  _84 = *_424;
  _86 = ref_182 + _423;
  _87 = *_86;
  if (_84 == _87)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 67084079]:
  goto ; [100.00%]

   [local count: 14847855]:
  # len_263 = PHI 
  # _262 = PHI <_423(32), _423(31)>
  # _264 = PHI <_424(32), _424(31)>
  len_190 = len_263 + 4294967295;
  if (len_190 <= 6)
goto ; [0.00%]
  else
goto ; [100.00%]

Then in ivopts, instructions are updated to xxx.c.174t.ivopts:

   [local count: 75120046]:
  # ivtmp.30_29 = PHI 
  _34 = (unsigned int) ivtmp.30_29;
  len_160 = _34 + 4294967295;
  _423 = ivtmp.30_29;
  _35 = (unsigned long) ip_229;
  _420 = ivtmp.30_29 + _35;
  _419 = (uint8_t *) _420;
  _424 = _419;
  len_418 = (unsigned int) ivtmp.30_29;
  if (maxlen_186 > len_418)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 70988443]:
  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
  ivtmp.30_31 = ivtmp.30_29 + 1;
  _417 = ref_182 + 18446744073709551615;
  _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
  if (_84 == _87)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 67084079]:
  goto ; [100.00%]

   [local count: 14847855]:
  # len_263 = PHI 
  # _262 = PHI <_423(32), _423(31)>
  # _264 = PHI <_424(32), _424(31)>
  len_190 = len_263 + 4294967295;
  if (len_190 <= 6)
goto ; [0.00%]
  else
goto ; [100.00%]

Some instructions in BB 31 are not used in the loop and could be sinked
out of loop to reduce the computation, but they are not sinked
throughout all passes later.  Run the sink_code pass once more at least
after fre5 could improve this typical case performance 23% due to few
instructions exausted in loop.
xxx.c.209t.sink2:

Sinking _419 = (uint8_t *) _420;
 from bb 31 to bb 89
Sinking _420 = ivtmp.30_29 + _35;
 from bb 31 to bb 89
Sinking _35 = (unsigned long) ip_229;
 from bb 31 to bb 89
Sinking len_160 = _34 + 4294967295;
 from bb 31 to bb 33

I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
small with 0.25%.

The reason why it should be run after fre5 is fre would do some phi
optimization to expose the optimization.  The patch put it after
pass_modref is due to my guess that some gimple optimizations like
thread_jumps, dse, dce etc. could provide more opportunities for
sinking code.  Not sure it is the correct place to put.  I also
verified this issue exists in both X86 and ARM64.
Any comments?  Thanks.
---
 gcc/passes.def  | 1 +
 gcc/tree-ssa-sink.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/gcc/passes.def b/gcc/passes.def
index 21b2e2af0f7..69106615729 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -355,6 +355,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_uncprop);
   NEXT_PASS (pass_local_pure_const);
   NEXT_PASS (pass_modref);
+  NEXT_PASS (pass_sink_code);
   POP_INSERT_PASSES ()
   NEXT_PASS (pass_all_optimizations_g);
   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations_g)
diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c
index b0abf4147d6..824659f3919 100644
--- a/gcc/tree-ssa-sink.c
+++ b/gcc/tree-ssa-sink.c
@@ -819,6 +819,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_sink != 0; }
   virtual unsigned int execute (function *);
+  opt_pass *clone (void) { return new pass_sink_code (m_ctxt); }
 
 }; // class pass_sink_code
 
-- 
2.27.0.90.geebb51ba8c