[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-07-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #16 from luoxhu at gcc dot gnu.org ---
The attached files are all built with -mcpu=power8 and the case also fails on
P8LE.
Also I verified the code produces expected output on P8BE. ('Aborted' is caused
by BE returns 0x41 instead of 0x98 for LE.)

P8LE :

luoxhu@gcc135 build $ ./q.bad
B0: 0, 0,0,0
Aborted

P8BE:
luoxhu@gcc203:~/workspace/build$ ./q.bad
B0: 41fcef98, 91648e8b,7dca18c6,61707865
Aborted


P8BE seems generates better code with the patch:

luoxhu@gcc203:~/workspace/build$ diff q.good.S q.bad.S -U5
--- q.good.S2022-07-26 09:19:32.487216946 +0300
+++ q.bad.S 2022-07-26 09:15:58.006770996 +0300
@@ -1,6 +1,7 @@
.file   "q.C"
+   .machine power8
.section".text"
.section.rodata.str1.8,"aMS",@progbits,1
.align 3
 .LC0:
.string "B0: %x, %x,%x,%x\n"
@@ -24,19 +25,17 @@
.cfi_def_cfa_offset 128
.cfi_offset 65, 16
.cfi_offset 30, -16
.cfi_offset 31, -8
mr %r30,%r3
-   vmrghw %v2,%v2,%v4
-   vmrghw %v5,%v3,%v5
-   vmrghw %v5,%v2,%v5
-   vspltw %v0,%v5,3
+   vspltw %v0,%v5,0
mfvsrwz %r7,%vs32
-   vspltw %v0,%v5,2
+   vspltw %v0,%v4,0
mfvsrwz %r6,%vs32
-   mfvsrwz %r5,%vs37
-   vspltw %v0,%v5,0
+   vspltw %v0,%v3,0
+   mfvsrwz %r5,%vs32
+   vspltw %v0,%v2,0
mfvsrwz %r31,%vs32
rldicl %r7,%r7,0,32
rldicl %r6,%r6,0,32
rldicl %r5,%r5,0,32
rldicl %r4,%r31,0,32
@@ -169,6 +168,6 @@
.set.LANCHOR1,. + 0
.type   res, @object
.size   res, 1
 res:
.zero   1
-   .ident  "GCC: (Debian 9.5.0-1) 9.5.0"
+   .ident  "GCC: (GNU) 13.0.0 20220726 (experimental)"

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #15 from luoxhu at gcc dot gnu.org ---
In combine: vec_select(vec_concat and the followed vec_select are combined to a
single extract instruction, which seems reasonable for both LE and BE?

R146:   0 1 2 3
R141:   4 5 6 7
R150:   2 6 3 7// vec_select(vec_concat(r146:V4SI,r141:V4SI),[2 6 3 7])
R151:   R150[3]// vec_select(r150:V4SI,3)

=> 

R151:   R141[3]   //  vec_select(r141:V4SI,3)



Trying 21 -> 24:
   21: r150:V4SI=vec_select(vec_concat(r146:V4SI,r141:V4SI),parallel)
  REG_DEAD r146:V4SI
  REG_DEAD r141:V4SI
   24: {r151:SI=vec_select(r150:V4SI,parallel);clobber scratch;}
Failed to match this instruction:
(parallel [
(set (reg:SI 151)
(vec_select:SI (reg:V4SI 141)
(parallel [
(const_int 3 [0x3])
])))
(clobber (scratch:SI))
(set (reg:V4SI 150)
(vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146)
(reg:V4SI 141))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
])))
])
Failed to match this instruction:
(parallel [
(set (reg:SI 151)
(vec_select:SI (reg:V4SI 141)
(parallel [
(const_int 3 [0x3])
])))
(set (reg:V4SI 150)
(vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146)
(reg:V4SI 141))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
])))
])
Successfully matched this instruction:
(set (reg:V4SI 150)
(vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146)
(reg:V4SI 141))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
])))
Successfully matched this instruction:
(set (reg:SI 151)
(vec_select:SI (reg:V4SI 141)
(parallel [
(const_int 3 [0x3])
])))
allowing combination of insns 21 and 24
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
modifying insn i221:
r150:V4SI=vec_select(vec_concat(r146:V4SI,r141:V4SI),parallel)
  REG_DEAD r146:V4SI
deferring rescan insn with uid = 21.
modifying insn i324: {r151:SI=vec_select(r141:V4SI,parallel);clobber
scratch;}
  REG_DEAD r141:V4SI
deferring rescan insn with uid = 24.


I guess the previous unspec implementation bypassed the LE + LE swap check, so
now in split2, we should generate vextuwlx instead of vextuwrx on little
endian?

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #14 from luoxhu at gcc dot gnu.org ---
Created attachment 53354
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53354=edit
split2

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #13 from luoxhu at gcc dot gnu.org ---
Created attachment 53353
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53353=edit
after combine

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #12 from luoxhu at gcc dot gnu.org ---
Created attachment 53352
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53352=edit
combine

[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293

--- Comment #5 from luoxhu at gcc dot gnu.org ---
r12-6086

[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Could you try revert (In reply to Richard Biener from comment #2)
> I can reproduce a regression with -Ofast -march=znver2 running on Haswell as
> well.  -fopt-info doesn't reveal anything interesting besides
> 
> -fast_algorithms.c:133:19: optimized: loop with 2 iterations completely
> unrolled (header execution count 32987933)
> +fast_algorithms.c:133:19: optimized: loop with 2 iterations completely
> unrolled (header execution count 129072791)
> 
> obviously the slowdown is in P7Viterbi.  There's only minimal changes on the
> GIMPLE side, one notable:
> 
>   niters_vector_mult_vf.205_2406 = niters.203_442 & 429496729 |   _2041 =
> niters.203_438 & 3;
>   _2408 = (int) niters_vector_mult_vf.205_2406;   |   if (_2041
> == 0)
>   tmp.206_2407 = k_384 + _2408;   | goto  66>; [25.00%]
>   _2300 = niters.203_442 & 3; <
>   if (_2300 == 0) <
> goto ; [25.00%]<
>   elseelse
> goto ; [75.00%]  goto  36>; [75.00%]
> 
>[local count: 41646173]:|   
> [local count: 177683003]:
>   # k_2403 = PHI  |  
> niters_vector_mult_vf.205_2409 = niters.203_438 & 429496729
>   # DEBUG k => k_2403 |   _2411 =
> (int) niters_vector_mult_vf.205_2409;
>   >  
> tmp.206_2410 = k_382 + _2411;
>   >
>   >   
> [local count: 162950122]:
>   >   # k_2406 =
> PHI 
> 
> the sink pass now does the transform where it did not do so before.
> 
> That's appearantly because of
> 
>   /* If BEST_BB is at the same nesting level, then require it to have
>  significantly lower execution frequency to avoid gratuitous movement. 
> */
>   if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
>   /* If result of comparsion is unknown, prefer EARLY_BB.
>  Thus use !(...>=..) rather than (...<...)  */
>   && !(best_bb->count * 100 >= early_bb->count * threshold))
> return best_bb;
> 
>   /* No better block found, so return EARLY_BB, which happens to be the
>  statement's original block.  */
>   return early_bb;
> 
> where the SRC count is 96726596 before, 236910671 after and the
> destination count is 72544947 before, 177683003 at the destination after.
> The edge probabilities are 75% vs 25% and param_sink_frequency_threshold
> is exactly 75 as well.  Since 236910671*0.75
> is rounded down it passes the test while the previous state has an exact
> match defeating it.
> 
> It's a little bit of an arbitrary choice,
> 
> diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
> index 2e744d6ae50..9b368e13463 100644
> --- a/gcc/tree-ssa-sink.cc
> +++ b/gcc/tree-ssa-sink.cc
> @@ -230,7 +230,7 @@ select_best_block (basic_block early_bb,
>if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
>/* If result of comparsion is unknown, prefer EARLY_BB.
>  Thus use !(...>=..) rather than (...<...)  */
> -  && !(best_bb->count * 100 >= early_bb->count * threshold))
> +  && !(best_bb->count * 100 > early_bb->count * threshold))
>  return best_bb;
>  
>/* No better block found, so return EARLY_BB, which happens to be the
> 
> fixes the missed sinking but not the regression :/
> 
> The count differences start to appear in when LC PHI blocks are added
> only for virtuals and then pre-existing 'Invalid sum of incoming counts'
> eventually lead to mismatches.  The 'Invalid sum of incoming counts'
> start with the loop splitting pass.
> 
> fast_algorithms.c:145:10: optimized: loop split
> 
> Xionghu Lou did profile count updates there, not sure if that made things
> worse in this case.
> 
> At least with broken BB counts splitting/unsplitting an edge can propagate
> bogus counts elsewhere it seems.

:(, Could you please try revert cd5ae148c47c6dee05adb19acd6a523f7187be7f and
see whether performance is back?

[Bug tree-optimization/105740] missed optimization switch transformation for conditions with duplicate conditions

2022-06-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740

--- Comment #10 from luoxhu at gcc dot gnu.org ---
(In reply to Martin Liška from comment #9)
> (In reply to luoxhu from comment #8)
> > (In reply to rguent...@suse.de from comment #6)
> > > On Tue, 21 Jun 2022, jakub at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740
> > > > 
> > > > --- Comment #5 from Jakub Jelinek  ---
> > > > The problem with switch-conversion done multiple times is that when it 
> > > > is done
> > > > early, it can do worse job than when it is done late, e.g. we can have 
> > > > better
> > > > range information later which allows (unfortunately switch-conversion 
> > > > doesn't
> > > > use that yet, there is a PR about it) to ignore some never reachable 
> > > > values
> > > > etc.
> > > > So ideally we either need to be able to undo switch-conversion and redo 
> > > > it if
> > > > things have changed, or do it only late and for e.g. inlining costs 
> > > > perform it
> > > > only in analysis mode and record somewhere what kind of lowering would 
> > > > be done
> > > > and how much it would cost.
> > > > With multiple if-to-switch, don't we risk that we turn some ifs into 
> > > > switch,
> > > > then
> > > > switch-conversion lowers it back to ifs and then another if-to-switch 
> > > > matches
> > > > it again and again lowers it?
> > > 
> > > Yeah, I think ideally switch conversion would be done as part of switch
> > > lowering (plus maybe an extra if-to-switch).  The issue might be what
> > > I said - some passes don't like switches, but they probably need to be
> > > taught.  As of inline cost yes, doing likely-switch-converted analysis
> > > would probably work.
> > 
> > git diff
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index b257307e085..1376e7cb28d 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -243,8 +243,6 @@ along with GCC; see the file COPYING3.  If not see
> >  Clean them up.  Failure to do so well can lead to false
> >  positives from warnings for erroneous code.  */
> >NEXT_PASS (pass_copy_prop);
> >/* Identify paths that should never be executed in a conforming
> >  program and isolate those paths.  */
> >NEXT_PASS (pass_isolate_erroneous_paths);
> > @@ -329,6 +327,7 @@ along with GCC; see the file COPYING3.  If not see
> >POP_INSERT_PASSES ()
> >NEXT_PASS (pass_simduid_cleanup);
> >NEXT_PASS (pass_lower_vector_ssa);
> > +  NEXT_PASS (pass_if_to_switch);
> >NEXT_PASS (pass_lower_switch);
> >NEXT_PASS (pass_cse_reciprocals);
> >NEXT_PASS (pass_reassoc, false /* early_p */);
> > 
> > Tried this to add the second if_to_switch before lower_switch, but switch
> > lowering doesn't work same as switch_conversion:
> 
> Note the lowering expand to a decision tree where node of such tree can be
> jump-tables,
> bit-tests or simple comparisons.
> 
> > 
> > ;; Function test2 (test2, funcdef_no=0, decl_uid=1982, cgraph_uid=1,
> > symbol_order=0)
> > 
> > beginning to process the following SWITCH statement ((null):0) : ---
> > switch (_2)  [INV], case 1:  [INV], case 2:  [INV],
> > case 3:  [INV], case 4:  > 3> [INV], case 5:  [INV], case 6:  [INV]>
> > 
> > ;; GIMPLE switch case clusters: JT(values:6 comparisons:6 range:6 density:
> > 100.00%):1-6
> 
> So jump-table is selected. Where do you see this GIMPLE representation?

This is dumped by the second run of iftoswitch after fre5.

> 
> ...
> 
> > 
> > ASM still contains indirect jump table like -fno-switch-conversion:
> 
> > 
> > Is this bug of lower_switch or expected?
> 
> What bug do you mean? 

Sorry, it not a bug, got to know that switch lower and switch conversion are
doing two different things, different with "pass_lower_switch
also performs the transforms switch-conversion does" in c#4?

> 
> > From the code, they have different
> > purpose as switch_conversion turns switch to single if-else while
> 
> No switch_conversion expands a switch statement to a series of assignment
> based on CSWITCH[index] arrays.
> 
> > lower_switch expand CLUSTERS as a decision tree.

Yes, rerun pass_convert_switch after the second if_to_switch could generate the
CSWITCH[index]. 

pr105740.c.195t.switchconv2:

   [local count: 1073741824]:
  if (x_4(D) > 3)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 536870913]:
  _1 = f_6(D)->arr[3];
  _10 = (unsigned int) _1;
  _2 = _10 + 4294967295;
  if (_2 <= 5)
goto ; [INV]
  else
goto ; [INV]

   [local count: 1073741822]:
:
  _8 = 0;
  goto ; [100.00%]

   [local count: 1073741822]:
:
  _9 = CSWTCH.4[_2];

   [local count: 2147483644]:
  # _3 = PHI <_8(4), 0(2), _9(5)>
:
:
  return _3;

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-06-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #8 from luoxhu at gcc dot gnu.org ---
init-regs:

(insn 13 8 17 2 (set (reg:V4SI 141)
(vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 135 [ R2 ])
(reg/v:V4SI 133 [ R0 ]))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
]))) "q.C":22:45 1785 {altivec_vmrglw_direct_v4si}
 (expr_list:REG_DEAD (reg/v:V4SI 135 [ R2 ])
(expr_list:REG_DEAD (reg/v:V4SI 133 [ R0 ])
(nil
(insn 17 13 21 2 (set (reg:V4SI 146)
(vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 136 [ R3 ])
(reg/v:V4SI 134 [ R1 ]))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
]))) "q.C":23:45 1785 {altivec_vmrglw_direct_v4si}
 (expr_list:REG_DEAD (reg/v:V4SI 136 [ R3 ])
(expr_list:REG_DEAD (reg/v:V4SI 134 [ R1 ])
(nil
(insn 21 17 24 2 (set (reg:V4SI 150)
(vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146)
(reg:V4SI 141))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
]))) "q.C":26:6 1785 {altivec_vmrglw_direct_v4si}
 (expr_list:REG_DEAD (reg:V4SI 146)
(expr_list:REG_DEAD (reg:V4SI 141)
(nil
(insn 24 21 25 2 (parallel [
(set (reg:SI 151)
(vec_select:SI (reg:V4SI 150)
(parallel [
(const_int 3 [0x3])
])))
(clobber (scratch:V4SI))
]) "q.C":28:10 1400 {*vsx_extract_si}
 (nil))
(insn 25 24 26 2 (set (reg:DI 152)
(zero_extend:DI (reg:SI 151))) "q.C":28:10 16 {zero_extendsidi2}
 (expr_list:REG_DEAD (reg:SI 151)
(nil)))
(insn 26 25 27 2 (parallel [
(set (reg:SI 153)
(vec_select:SI (reg:V4SI 150)
(parallel [
(const_int 2 [0x2])
])))
(clobber (scratch:V4SI))
]) "q.C":28:10 1400 {*vsx_extract_si}
 (nil))
(insn 27 26 28 2 (set (reg:DI 154)
(zero_extend:DI (reg:SI 153))) "q.C":28:10 16 {zero_extendsidi2}
 (expr_list:REG_DEAD (reg:SI 153)
(nil)))
(insn 28 27 29 2 (parallel [
(set (reg:SI 155)
(vec_select:SI (reg:V4SI 150)
(parallel [
(const_int 1 [0x1])
])))
(clobber (scratch:V4SI))
]) "q.C":28:10 1400 {*vsx_extract_si}
 (nil))
(insn 29 28 30 2 (set (reg:DI 156)
(zero_extend:DI (reg:SI 155))) "q.C":28:10 16 {zero_extendsidi2}
 (expr_list:REG_DEAD (reg:SI 155)
(nil)))
(insn 30 29 31 2 (parallel [
(set (reg:SI 157)
(vec_select:SI (reg:V4SI 150)
(parallel [
(const_int 0 [0])
])))
(clobber (scratch:V4SI))
]) "q.C":28:10 1400 {*vsx_extract_si}
 (expr_list:REG_DEAD (reg:V4SI 150)
(nil)))


combine:

Trying 13 -> 28:
   13: r141:V4SI=vec_select(vec_concat(r164:V4SI,r162:V4SI),parallel)
  REG_DEAD r164:V4SI
   28: {r155:SI=vec_select(r141:V4SI,parallel);clobber scratch;}
  REG_DEAD r141:V4SI
Successfully matched this instruction:
(parallel [
(set (reg:SI 155)
(vec_select:SI (reg:V4SI 164)
(parallel [
(const_int 3 [0x3])
])))
(clobber (scratch:V4SI))
])
allowing combination of insns 13 and 28
original costs 4 + 8 = 12
replacement cost 8
deferring deletion of insn with uid = 13.
modifying insn i328: {r155:SI=vec_select(r164:V4SI,parallel);clobber
scratch;}
  REG_DEAD r164:V4SI
deferring rescan insn with uid = 28.



(note 7 47 8 2 NOTE_INSN_DELETED)
(note 8 7 13 2 NOTE_INSN_FUNCTION_BEG)
(note 13 8 17 2 NOTE_INSN_DELETED)
(note 17 13 21 2 NOTE_INSN_DELETED)
(note 21 17 24 2 NOTE_INSN_DELETED)
(insn 24 21 25 2 (parallel [
(set (reg:SI 151)
(vec_select:SI (reg:V4SI 162)
(parallel [
(const_int 3 [0x3])
])))
(clobber (scratch:V4SI))
]) "q.C":28:10 1400 {*vsx_extract_si}
 (expr_list:REG_DEAD (reg:V4SI 162)
(nil)))
(note 25 24 26 2 NOTE_INSN_DELETED)
(insn 26 25 27 2 (parallel [
(set (reg:SI 153)
(vec_select:SI (reg:V4SI 163)
(parallel [

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-06-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #5 from luoxhu at gcc dot gnu.org ---
Seems combine wrongly merged two vec_select instructions:

Trying 188 -> 199:
  188: r343:V4SI=vec_select(vec_concat(r168:V4SI,r338:V4SI),parallel)
  REG_DEAD r338:V4SI
  REG_DEAD r168:V4SI
  199: {r353:SI=vec_select(r343:V4SI,parallel);clobber scratch;}
Failed to match this instruction:
(parallel [
(set (reg:SI 353)
(vec_select:SI (reg:V4SI 338)
(parallel [
(const_int 3 [0x3])
])))
(clobber (scratch:V4SI))
(set (reg:V4SI 343)
(vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 168 [ R02$m_simd ])
(reg:V4SI 338))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
])))
])
Failed to match this instruction:
(parallel [
(set (reg:SI 353)
(vec_select:SI (reg:V4SI 338)
(parallel [
(const_int 3 [0x3])
])))
(set (reg:V4SI 343)
(vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 168 [ R02$m_simd ])
(reg:V4SI 338))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
])))
])
Successfully matched this instruction:
(set (reg:V4SI 343)
(vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 168 [ R02$m_simd ])
(reg:V4SI 338))
(parallel [
(const_int 2 [0x2])
(const_int 6 [0x6])
(const_int 3 [0x3])
(const_int 7 [0x7])
])))
Successfully matched this instruction:
(set (reg:SI 353)
(vec_select:SI (reg:V4SI 338)
(parallel [
(const_int 3 [0x3])
])))
allowing combination of insns 188 and 199
original costs 4 + 8 = 12
replacement costs 4 + 8 = 12
modifying insn i2   188:
r343:V4SI=vec_select(vec_concat(r168:V4SI,r338:V4SI),parallel)
  REG_DEAD r168:V4SI
deferring rescan insn with uid = 188.
modifying insn i3   199: {r353:SI=vec_select(r338:V4SI,parallel);clobber
scratch;}
  REG_DEAD r338:V4SI
deferring rescan insn with uid = 199.

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-06-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Reduced to:

#include 
extern "C" void *memcpy(void *, const void *, unsigned long);
typedef __attribute__((altivec(vector__))) unsigned native_simd_type;

union {
native_simd_type V;
int R[4];
} store_le_vec;

struct S {
S() = default;
S(unsigned B0) {
native_simd_type val{B0};
m_simd = val;
}
void store_le(unsigned char out[]) {
store_le_vec.V = m_simd;
unsigned int x0 = store_le_vec.R[0];
memcpy(out, , 1);
}
static void transpose(S , S B1, S B2, S B3) {
native_simd_type T0 = __builtin_vec_mergeh(B0.m_simd,
B2.m_simd);
native_simd_type T1 = __builtin_vec_mergeh(B1.m_simd,
B3.m_simd);
native_simd_type T2 = __builtin_vec_mergel(B0.m_simd,
B2.m_simd);
native_simd_type T3 = __builtin_vec_mergel(B1.m_simd,
B3.m_simd);
B0 = __builtin_vec_mergeh(T0, T1);
B3 = __builtin_vec_mergel(T2, T3);
printf ("B0: %x, %x,%x,%x\n", B0.m_simd[0], B0.m_simd[1],
B0.m_simd[2], B0.m_simd[3]);
}
S(native_simd_type x) : m_simd(x) {}
native_simd_type m_simd;
};

void
foo (unsigned char output[], unsigned state[], native_simd_type R0,
native_simd_type R1, native_simd_type R2, native_simd_type R3)
{
S R00; R00.m_simd = R0;
S R01; R01.m_simd = R1;
S R02; R02.m_simd = R2;
S R03; R03.m_simd = R3;
S::transpose(R00, R01, R02, R03);
R00.store_le(output);
}

unsigned char res[1];
unsigned main_state[]{1634760805, 60878,  2036477234, 6,
0,  825562964,  1471091955, 1346092787,
506976774,  4197066702, 518848283,  118491664,
0,  0,  0,  0};
int
main ()
{
native_simd_type R0 = native_simd_type {0x41fcef98, 0,0,0};
native_simd_type R1 =  native_simd_type {0x91648e8b, 0,0,0};
native_simd_type R2 = native_simd_type  {0x7dca18c6, 0,0,0};
native_simd_type R3 = native_simd_type  {0x61707865, 0,0,0};
foo (res, main_state, R0, R1, R2, R3);
if (res[0] != 152)
__builtin_abort();
}

[Bug tree-optimization/106126] [12 Regression] tree check fail in useless_type_conversion_p, at gimple-expr.cc:87 since r13-1184-g57424087e82db140

2022-06-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106126

--- Comment #13 from luoxhu at gcc dot gnu.org ---
Otherwise we need record first_bb when conditions_in_bbs->is_empty, then check
that in is_beneficial, ordered_remove the info entry if that bb is not the
first "if condition" with side_effect statement in it, the fix would be as
below, but I am not sure whether it is worth way doing this to handle
both PR105740 and PR106126?


git diff
diff --git a/gcc/gimple-if-to-switch.cc b/gcc/gimple-if-to-switch.cc
index f7b0b02628b..44bb0228856 100644
--- a/gcc/gimple-if-to-switch.cc
+++ b/gcc/gimple-if-to-switch.cc
@@ -63,7 +63,7 @@ struct condition_info

   condition_info (gcond *cond): m_cond (cond), m_bb (gimple_bb (cond)),
 m_forwarder_bb (NULL), m_ranges (), m_true_edge (NULL), m_false_edge
(NULL),
-m_true_edge_phi_mapping (), m_false_edge_phi_mapping ()
+m_true_edge_phi_mapping (), m_false_edge_phi_mapping (), first_bb(false)
   {
 m_ranges.create (0);
   }
@@ -80,6 +80,7 @@ struct condition_info
   edge m_false_edge;
   mapping_vec m_true_edge_phi_mapping;
   mapping_vec m_false_edge_phi_mapping;
+  bool first_bb;
 };

 /* Recond PHI mapping for an original edge E and save these into vector VEC. 
*/
@@ -194,6 +195,16 @@ if_chain::is_beneficial ()
   auto_vec clusters;
   clusters.create (m_entries.length ());

+  for (unsigned i = 0; i < m_entries.length (); i++)
+{
+  condition_info *info = m_entries[i];
+  if (info->first_bb && i != 0 && !no_side_effect_bb (info->m_bb))
+   {
+ m_entries.ordered_remove (i);
+ break;
+   }
+}
+
   for (unsigned i = 0; i < m_entries.length (); i++)
 {
   condition_info *info = m_entries[i];
@@ -397,6 +408,8 @@ find_conditions (basic_block bb,
   tree_code code = gimple_cond_code (cond);

   condition_info *info = new condition_info (cond);
+  if (conditions_in_bbs->is_empty ())
+info->first_bb = true;

[Bug tree-optimization/106126] [12 Regression] tree check fail in useless_type_conversion_p, at gimple-expr.cc:87 since r13-1184-g57424087e82db140

2022-06-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106126

--- Comment #12 from luoxhu at gcc dot gnu.org ---
conditions_in_bbs->is_empty doesn't mean that range is at the start of switch
condition:(, so couldn't assume to ignore the no_side_effect_bb check?

[Bug tree-optimization/106126] [12 Regression] tree check fail in useless_type_conversion_p, at gimple-expr.cc:87 since r13-1184-g57424087e82db140

2022-06-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106126

--- Comment #11 from luoxhu at gcc dot gnu.org ---
Sorry for breaking, my bugzilla account is luo...@gcc.gnu.org.

The patch seems reasonable to fold 65-90 ('A'-'Z') to switch statement, 

4,6c4,6
< ;; Canonical GIMPLE case clusters: 33 60 62 126
< ;; BT can be built: BT(values:3 comparisons:6 range:30 density: 20.00%):33-62
126
< pr106126.c:3:28: optimized: Condition chain with 4 BBs transformed into a
switch statement.
---
> ;; Canonical GIMPLE case clusters: 33 60 62 65-90 126
> ;; BT can be built: BT(values:3 comparisons:6 range:30 density: 20.00%):33-62 
> 65-90 126
> pr106126.c:3:28: optimized: Condition chain with 5 BBs transformed into a 
> switch statement.

...

96,97c108,109
<:
<   switch (_13)  [INV], case 33:  [INV], case 60: 
[INV], case 62:  [INV], case 126:  [INV]>
---
>:
>   switch (_13)  [INV], case 33:  [INV], case 60:  
> [INV], case 62:  [INV], case 65 ... 90:  [INV], case 126:  
> [INV]>




complete pr106126.bad.c.046t.iftoswitch:

;; Function pool_conda_matchspec (pool_conda_matchspec, funcdef_no=0,
decl_uid=1979, cgraph_uid=1, symbol_order=1)

;; Canonical GIMPLE case clusters: 33 60 62 65-90 126
;; BT can be built: BT(values:3 comparisons:6 range:30 density: 20.00%):33-62
65-90 126

pr106126.c:3:28: optimized: Condition chain with 5 BBs transformed into a
switch statement.
Removing basic block 9
;; basic block 9, loop depth 2
;;  pred:
if (_13 != 62)
  goto ; [INV]
else
  goto ; [INV]
;;  succ:   10
;;  12


Removing basic block 10
;; basic block 10, loop depth 2
;;  pred:
if (_13 != 33)
  goto ; [INV]
else
  goto ; [INV]
;;  succ:   11
;;  12


Removing basic block 11
;; basic block 11, loop depth 2
;;  pred:
if (_13 != 126)
  goto ; [INV]
else
  goto ; [INV]
;;  succ:   3
;;  12


Removing basic block 3
;; basic block 3, loop depth 2
;;  pred:
_3 = (unsigned char) _13;
_4 = _3 + 191;
if (_4 <= 25)
  goto ; [INV]
else
  goto ; [INV]
;;  succ:   14
;;  13


Expanded into a new gimple STMT: switch (_13)  [INV], case 33:
 [INV], case 60:  [INV], case 62:  [INV], case 65 ... 90: 
[INV], case 126:  [INV]>

Removing basic block 13
;; basic block 13, loop depth 2
;;  pred:
:
goto ; [100.00%]
;;  succ:   6


Removing basic block 14
;; basic block 14, loop depth 1
;;  pred:
:
;;  succ:   4


fix_loop_structure: fixing up loops for function
void pool_conda_matchspec ()
{
  unsigned char _8;
  char _10;
  char * var_1.3_11;
  char _13;
  unsigned char _14;
  char * var_1.3_15;

   :
  goto ; [INV]

   :
  # _14 = PHI <_3(7)>
  # var_1.3_15 = PHI 
:
  _8 = _14 + 65;
  _10 = (char) _8;
  *var_1.3_15 = _10;

   :

   :
:
  var_1.3_11 = var_1;
  if (var_1.3_11 != 0B)
goto ; [INV]
  else
goto ; [INV]

   :
  _13 = *var_1.3_11;
  if (_13 != 0)
goto ; [INV]
  else
goto ; [INV]

   :
  switch (_13)  [INV], case 33:  [INV], case 60: 
[INV], case 62:  [INV], case 65 ... 90:  [INV], case 126: 
[INV]>

   :
:
  return;
  _8 = _14 + 65;
  _10 = (char) _8;
  *var_1.3_15 = _10;

   :

   :
:
  var_1.3_11 = var_1;
  if (var_1.3_11 != 0B)
goto ; [INV]
  else
goto ; [INV]

   :
  _13 = *var_1.3_11;
  if (_13 != 0)
goto ; [INV]
  else
goto ; [INV]

   :
  switch (_13)  [INV], case 33:  [INV], case 60: 
[INV], case 62:  [INV], case 65 ... 90:  [INV], case 126: 
[INV]>

   :
:
  return;

}


The problem is _3 is removed in basic block 3, but _14 is still using it.

[Bug tree-optimization/105903] Missed optimization for __synth3way

2022-06-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105903

--- Comment #2 from luoxhu at gcc dot gnu.org ---
diff --git a/gcc/match.pd b/gcc/match.pd
index 4a570894b2e..f6b5415a351 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5718,6 +5718,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (bit_xor (convert (rshift @0 {shifter;})) @1)
  (bit_not (bit_xor (convert (rshift @0 {shifter;})) @1)))

+/* X >= Y ? X > Y : 0 into X > Y. */
+(simplify
+  (cond (ge @0 @1) (gt @0 @1) integer_zerop)
+   (if (INTEGRAL_TYPE_P (type)
+   && POINTER_TYPE_P (TREE_TYPE (@0))
+   && POINTER_TYPE_P (TREE_TYPE (@1)))
+(gt @0 @1)))
+
+/* X < Y ? 0 : X > Y into X > Y.  */
+(simplify
+  (cond (lt @0 @1) integer_zerop (gt @0 @1))
+   (if (INTEGRAL_TYPE_P (type)
+   && POINTER_TYPE_P (TREE_TYPE (@0))
+   && POINTER_TYPE_P (TREE_TYPE (@1)))
+(gt @0 @1)))
+

The two patterns could fold PHI in phiopt4 for the two greater3way and generate
expected results.

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-06-23 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #2 from luoxhu at gcc dot gnu.org ---
Could you also paste the ASM difference please? (I don't have environment at
handle so far..)

[Bug tree-optimization/105740] missed optimization switch transformation for conditions with duplicate conditions

2022-06-22 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740

--- Comment #8 from luoxhu at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #6)
> On Tue, 21 Jun 2022, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740
> > 
> > --- Comment #5 from Jakub Jelinek  ---
> > The problem with switch-conversion done multiple times is that when it is 
> > done
> > early, it can do worse job than when it is done late, e.g. we can have 
> > better
> > range information later which allows (unfortunately switch-conversion 
> > doesn't
> > use that yet, there is a PR about it) to ignore some never reachable values
> > etc.
> > So ideally we either need to be able to undo switch-conversion and redo it 
> > if
> > things have changed, or do it only late and for e.g. inlining costs perform 
> > it
> > only in analysis mode and record somewhere what kind of lowering would be 
> > done
> > and how much it would cost.
> > With multiple if-to-switch, don't we risk that we turn some ifs into switch,
> > then
> > switch-conversion lowers it back to ifs and then another if-to-switch 
> > matches
> > it again and again lowers it?
> 
> Yeah, I think ideally switch conversion would be done as part of switch
> lowering (plus maybe an extra if-to-switch).  The issue might be what
> I said - some passes don't like switches, but they probably need to be
> taught.  As of inline cost yes, doing likely-switch-converted analysis
> would probably work.

git diff
diff --git a/gcc/passes.def b/gcc/passes.def
index b257307e085..1376e7cb28d 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -243,8 +243,6 @@ along with GCC; see the file COPYING3.  If not see
 Clean them up.  Failure to do so well can lead to false
 positives from warnings for erroneous code.  */
   NEXT_PASS (pass_copy_prop);
   /* Identify paths that should never be executed in a conforming
 program and isolate those paths.  */
   NEXT_PASS (pass_isolate_erroneous_paths);
@@ -329,6 +327,7 @@ along with GCC; see the file COPYING3.  If not see
   POP_INSERT_PASSES ()
   NEXT_PASS (pass_simduid_cleanup);
   NEXT_PASS (pass_lower_vector_ssa);
+  NEXT_PASS (pass_if_to_switch);
   NEXT_PASS (pass_lower_switch);
   NEXT_PASS (pass_cse_reciprocals);
   NEXT_PASS (pass_reassoc, false /* early_p */);

Tried this to add the second if_to_switch before lower_switch, but switch
lowering doesn't work same as switch_conversion:

;; Function test2 (test2, funcdef_no=0, decl_uid=1982, cgraph_uid=1,
symbol_order=0)

beginning to process the following SWITCH statement ((null):0) : ---
switch (_2)  [INV], case 1:  [INV], case 2:  [INV],
case 3:  [INV], case 4:  [INV], case 5:  [INV], case 6:  [INV]>

;; GIMPLE switch case clusters: JT(values:6 comparisons:6 range:6 density:
100.00%):1-6
Removing basic block 11
;; basic block 11, loop depth 0
;;  pred:
switch (_2)  [INV], case 1:  [INV], case 2:  [INV],
case 3:  [INV], case 4:  [INV], case 5:  [INV], case 6:  [INV]>
;;  succ:   4
;;  5
;;  6
;;  7
;;  8
;;  9
;;  10



Updating SSA:
Registering new PHI nodes in block #0
Registering new PHI nodes in block #2
Updating SSA information for statement _1 = f_10(D)->len;
Registering new PHI nodes in block #3
Updating SSA information for statement _2 = f_10(D)->arr[3];
...
int test2 (struct fs * f)
{
  int _1;
  int _2;
  int _8;

   [local count: 1073741824]:
  _1 = f_10(D)->len;
  if (_1 > 3)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 536870913]:
  _2 = f_10(D)->arr[3];
  switch (_2)  [0.00%], case 1:  [16.67%], case 2: 
[16.67%], case 3:  [16.67%], case 4:  [16.67%], case 5: 
[16.67%], case 6:  [16.67%]>

   [local count: 67108864]:
:
  goto ; [100.00%]

   [local count: 62914560]:
:
  goto ; [100.00%]

   [local count: 58982400]:
:
  goto ; [100.00%]

   [local count: 55296000]:
:
  goto ; [100.00%]

   [local count: 5184]:
:
  goto ; [100.00%]

   [local count: 4860]:
:

   [local count: 1073741824]: 
 # _8 = PHI <12(4), 27(5), 38(6), 18(7), 58(8), 68(9), 0(3), 0(2)>
:
  return _8;

}

ASM still contains indirect jump table like -fno-switch-conversion:

test2:
.LFB0:
.cfi_startproc
xorl%eax, %eax
cmpl$3, (%rdi)
jle .L1
cmpl$6, 16(%rdi)
ja  .L3
movl16(%rdi), %eax
jmp *.L5(,%rax,8)
.section.rodata
.align 8
.align 4
.L5:
.quad   .L3
.quad   .L11
.quad   .L9
.quad   .L8
.quad   .L7
.quad   .L6
.quad   .L4
.text
.p2align 4,,10
.p2align 3
.L11:
movl$12, %eax
.L1:
ret

[Bug tree-optimization/105740] missed optimization switch transformation for conditions with duplicate conditions

2022-06-20 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740

--- Comment #2 from luoxhu at gcc dot gnu.org ---
Run if_to_switch and convert_switch again after copyprop2 could remove the
redundant statement and expose opportunity for if-to-switch again, is this
reasonable or just move if-to-switch/switch-conversion later run only once?  


diff --git a/gcc/gimple-if-to-switch.cc b/gcc/gimple-if-to-switch.cc
index f7b0b02628b..8f55d0e2f75 100644
--- a/gcc/gimple-if-to-switch.cc
+++ b/gcc/gimple-if-to-switch.cc
@@ -484,6 +484,8 @@ public:
|| bit_test_cluster::is_enabled ());
   }

+  opt_pass *clone () { return new pass_if_to_switch (m_ctxt); }
+
   virtual unsigned int execute (function *);

 }; // class pass_if_to_switch
diff --git a/gcc/passes.def b/gcc/passes.def
index 375d3d62d51..b257307e085 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -243,6 +243,8 @@ along with GCC; see the file COPYING3.  If not see
 Clean them up.  Failure to do so well can lead to false
 positives from warnings for erroneous code.  */
   NEXT_PASS (pass_copy_prop);
+  NEXT_PASS (pass_if_to_switch);
+  NEXT_PASS (pass_convert_switch);
   /* Identify paths that should never be executed in a conforming
 program and isolate those paths.  */
   NEXT_PASS (pass_isolate_erroneous_paths);
diff --git a/gcc/tree-switch-conversion.cc b/gcc/tree-switch-conversion.cc
index 50a17927f39..d5c8262785e 100644
--- a/gcc/tree-switch-conversion.cc
+++ b/gcc/tree-switch-conversion.cc
@@ -2429,6 +2429,9 @@ public:

   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_switch_conversion != 0; }
+
+  opt_pass *clone () { return new pass_convert_switch (m_ctxt); }
+
   virtual unsigned int execute (function *);

[Bug ipa/100034] missed optimization for dead code elimination at -O3 (vs. -O1, -Os, -O2)

2022-06-08 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100034

--- Comment #2 from luoxhu at gcc dot gnu.org ---
(In reply to Richard Biener from comment #1)
> Looks related to PR1 - we do an IPA SRA clone but fail to inline it and
> thus we end up with
> 
> void d.isra ()
> {
>   int D.1980;
>   int g.2_1;
> 
>[local count: 10631108]:
> 
>[local count: 96646437]:
>   g.2_1 = 0;
>   if (g.2_1 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 1073741824]:
>   foo ();
>   goto ; [100.00%]
> 
> }
> 
> int main ()
> {
>   int a.0_2;
>   int b.1_3;
> 
>[local count: 59461674]:
>   goto ; [100.00%]
> 
>[local count: 1014686025]:
>   a.0_2 = a;
>   if (a.0_2 == 0)
> goto ; [99.96%]
>   else
> goto ; [0.04%]
> 
>[local count: 1014280151]:
>   // predicted unlikely by continue predictor.
>   goto ; [100.00%]
> 
>[local count: 405874]:
>   d.isra ();
> 
>[local count: 1073741824]:
>   b.1_3 = b;
>   if (b.1_3 != 0)
> goto ; [94.50%]
>   else
> goto ; [5.50%]
> 
>[local count: 59055800]:
>   return 0;
> 
> }
> 
> where we optimize main to 'return 0' but fail to elide the unused d.isra.
> 
> So also a dup of the cases where a late IPA function reclaim is missing.

early_inliner inlines e to main in -O3 due to param_early_inlining_insns is 14
for O3, but it is 6 for -O2, so want_early_inline_function_p returns different.

Then ipa-inline fails to inline d.isra by inline_functions_called_once as it is
called by two callees e->d.isra and main->d.isra.

But The two d.isra calls are removed by gimple 102t.ccp2 pass after all ipa
passes:


pr100034.O3.c.103t.objsz1:

;; Function d.isra (d.isra.0, funcdef_no=4, decl_uid=2014, cgraph_uid=7,
symbol_order=10) (executed once)

void d.isra ()
{
  int D.2016;

   [local count: 10631108]:

   [local count: 1073741824]:
  foo ();
  goto ; [100.00%]

}



;; Function e (e, funcdef_no=2, decl_uid=1994, cgraph_uid=3, symbol_order=6)

void e ()
{
   [local count: 59461674]:
  return;

}



;; Function main (main, funcdef_no=3, decl_uid=1999, cgraph_uid=4,
symbol_order=7) (executed once)

int main ()
{
   [local count: 59461674]:
  return 0;

} 

Currently all IPA passes are run before gimple optimizations, is it possible to
run some passes like pass_rebuild_cgraph_edges and pass_ipa_remove_symbols
after some gimple optimisations expose new opertunities?

[Bug ipa/93318] [10 regression] Firefox LTO+FDO ICEs in speculative_call_info

2022-05-13 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93318

--- Comment #10 from luoxhu at gcc dot gnu.org ---
And the Profile id of that node is streamed to many objects after lto
partition:

grep --  "19598949" **
db_server.ltrans0.000i.cgraph:  Profile id: 19598949
db_server.ltrans0.000i.cgraph:  Profile id: 19598949
db_server.ltrans0.000i.cgraph:  Profile id: 19598949
db_server.ltrans0.079i.inline:  Profile id: 19598949
db_server.ltrans0.079i.inline:  Profile id: 19598949
db_server.ltrans12.000i.cgraph:  Profile id: 19598949
db_server.ltrans12.000i.cgraph:  Profile id: 19598949
db_server.ltrans12.000i.cgraph:  Profile id: 19598949
db_server.ltrans14.000i.cgraph:  Profile id: 19598949
db_server.ltrans26.000i.cgraph:  Profile id: 19598949
db_server.ltrans26.000i.cgraph:  Profile id: 19598949
db_server.ltrans26.000i.cgraph:  Profile id: 19598949
db_server.ltrans31.000i.cgraph:  Profile id: 19598949
db_server.ltrans32.000i.cgraph:  Profile id: 19598949
db_server.wpa.000i.cgraph:  Profile id: 19598949
db_server.wpa.001i.lto-link:  Profile id: 19598949
db_server.wpa.003i.lto-partition:  Profile id: 19598949
db_server.wpa.070i.whole-program:  Profile id: 19598949
db_server.wpa.071i.profile_estimate:  Profile id: 19598949
db_server.wpa.072i.icf:  Profile id: 19598949
db_server.wpa.073i.devirt:  Profile id: 19598949
db_server.wpa.074i.cp:  Profile id: 19598949
db_server.wpa.075i.sra:  Profile id: 19598949
db_server.wpa.078i.fnsummary:  Profile id: 19598949
db_server.wpa.079i.inline:  Profile id: 19598949
db_server.wpa.080i.pure-const:  Profile id: 19598949
db_server.wpa.080i.pure-const:  Profile id: 19598949
db_server.wpa.080i.pure-const:  Profile id: 19598949
db_server.wpa.080i.pure-const:  Profile id: 19598949
db_server.wpa.082i.static-var:  Profile id: 19598949
db_server.wpa.082i.static-var:  Profile id: 19598949

[Bug ipa/93318] [10 regression] Firefox LTO+FDO ICEs in speculative_call_info

2022-05-13 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93318

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #9 from luoxhu at gcc dot gnu.org ---
I have a testcase ICE at:

external/com_google_protobuf/src/google/protobuf/message_lite.h:515:68:
internal compiler error: Segmentation fault
0xde2816 crash_signal
../../gcc/toplev.c:328
0xe82370 copy_bb
../../gcc/tree-inline.c:2204
0xe84afa copy_cfg_body
../../gcc/tree-inline.c:3022
0xe855ea copy_body
../../gcc/tree-inline.c:3270
0xe8945b expand_call_inline
../../gcc/tree-inline.c:5061
0xe8a055 gimple_expand_calls_inline
../../gcc/tree-inline.c:5251
0xe8a831 optimize_inline_calls(tree_node*)
../../gcc/tree-inline.c:5424
0xb976ea inline_transform(cgraph_node*)
../../gcc/ipa-inline-transform.c:736
0xd1a147 execute_one_ipa_transform_pass
../../gcc/passes.c:2233
0xd1a2a1 execute_all_ipa_transforms(bool)
../../gcc/passes.c:2272
0x901809 cgraph_node::expand()
../../gcc/cgraphunit.c:2293
0x901e4a expand_all_functions
../../gcc/cgraphunit.c:2471
0x9028dd symbol_table::compile()
../../gcc/cgraphunit.c:2822
0x834fbc lto_main()
../../gcc/lto/lto.c:653


tree-inline.c:2204

2204:cgraph_edge *indirect = old_edge->speculative_call_indirect_edge ();
2205:profile_count indir_cnt = indirect->count;

the returned indirect is 0 caused assert on 2205.



(gdb) p old_edge->caller->debug()
_ZNK6google8protobuf11MessageLite23IsInitializedWithErrorsEv/15805768
(IsInitializedWithErrors) @0x76d44438
  Type: function definition analyzed
  Visibility: external public visibility_specified visibility:hidden
  References: _ZNK4trpc15RequestProtocol13IsInitializedEv/15470318 (addr)
(speculative)
  Referring:
  Function IsInitializedWithErrors/15805768 is inline copy in
OnExtendedInfosReceive/3878638
  Availability: local
  Unit id: 1201
  Function flags: count:26415 (adjusted) first_run:577 body local hot
  Called by:
_ZN7yottadb2ds18BoundedReadWatcher22OnExtendedInfosReceiveERKSs/3878638
(inlined) (26415 (adjusted),1.00
per call) (can throw external)
  Calls:
_ZNK6google8protobuf11MessageLite29LogInitializationErrorMessageEv/15806151 (0
(guessed),0.00 per call) (can
throw external)
_ZNK7yottadb2ds28AppendLogRequestExtendedInfo13IsInitializedEv.constprop.0/16350633
(speculative) (inl
ined) (12547 (adjusted),0.47 per call) (can throw external)
_ZNK7yottadb2ds28AppendLogRequestExtendedInfo13IsInitializ
edEv.constprop.0/16375492 (inlined) (indirect_inlining) (13868 (adjusted),0.52
per call) (can throw external)
$84 = void
(gdb) p old_edge->callee->debug()
_ZNK7yottadb2ds28AppendLogRequestExtendedInfo13IsInitializedEv.constprop.0/16350633
(IsInitialized.constprop) @0x7
6d44b40
  Type: function definition analyzed
  Visibility: artificial
  References:
  Referring:
  Read from file: db_server.ltrans32.o
  Function IsInitialized.constprop/16350633 is inline copy in
OnExtendedInfosReceive/3878638
  Availability: local
  Unit id: 116
  Function flags: count:12547 (adjusted) first_run:8235 body local icf_merged
nonfreeing_fn
  Called by:
_ZNK6google8protobuf11MessageLite23IsInitializedWithErrorsEv/15805768
(speculative) (inlined) (12547 (adj
usted),0.47 per call) (can throw external)
  Calls:


In wpa.079i.inline, it has TWO *polymorphic indirect call* speculative targets,
I wrote a case like it but passed.

_ZNK6google8protobuf11MessageLite23IsInitializedWithErrorsEv/15805768
(IsInitializedWithErrors) @0x7efdc479a2d0
  Type: function definition analyzed
  Visibility: prevailing_def_ironly
  previous sharing asm name: 16375490
  References: _ZNK4trpc15RequestProtocol13IsInitializedEv/15470318 (addr)
(speculative) _ZNK7yottadb3rpc17RunCommandRequest13IsInitializedEv/9954194
(addr) (speculative)
  Referring:
  Read from file:
bazel-out/k8-dbg/bin/external/com_google_protobuf/libprotobuf_lite.a
  Availability: local
  Profile id: 19598949
  Unit id: 1200
  Function flags: count:1072 (adjusted) first_run:577 local
  Called by:
_ZN6google8protobuf11MessageLite9ParseFromILNS1_10ParseFlagsE1ESsEEbRKT0_/16456195
(1824663 (estimated locally),0.00 per call) (can throw external)
_ZN6google8protobuf11MessageLite9ParseFromILNS1_10ParseFlagsE1EPNS0_2io19ZeroCopyInputStreamEEEbRKT0_/15806727
(14 (adjusted),1.00 per call) (can throw external)
_ZN6google8protobuf11MessageLite9ParseFromILNS1_10ParseFlagsE1ESsEEbRKT0_/15806733
(1006 (adjusted),1.00 per call) (can throw external)
_ZN6google8protobuf11MessageLite9ParseFromILNS1_10ParseFlagsE1ENS0_11StringPieceEEEbRKT0_/15806735
(52 (precise),1.00 per call) (can throw external)
  Calls:
_ZNK7yottadb2ds28AppendLogRequestExtendedInfo13IsInitializedEv.constprop.0/16365519
(speculative) (inlined) (456 (adjusted),0.43 per call) (

[Bug lto/105133] lto/gold: lto failed to link --start-lib/--end-lib in gold for duplicate libraries

2022-04-05 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105133

--- Comment #2 from luoxhu at gcc dot gnu.org ---
(In reply to Richard Biener from comment #1)
> (In reply to luoxhu from comment #0)
> > 
> > cat hellow.res
> > 3
> > hello.o 2
> > 192 ccb9165e03755470 PREVAILING_DEF main
> > 197 ccb9165e03755470 PREVAILING_DEF_IRONLY s
> > ./B/libhello.c.o 1
> > 205 68e0b97e93a52d7a PREEMPTED_REG hello
> > ./C/libhello.c.o 1
> > 205 18fe2d3482bfb511 PREEMPTED_REG hello
> 
> This looks like a gold bug - we have 'hello' pre-empted twice but no
> prevailing
> symbol in the IR - are you ending up with fat LTO objects?

It is not fat LTO objects since I didn't add -ffat-lto-objects when generating
lib:

nm libhello.a

libhello.c.o:
nm: libhello.c.o: plugin needed to handle lto object
0001 C __gnu_lto_slim


> 
> OTOH PREEMPTED_REG seems then handled wrongly by LTO as well - it should
> throw away both copies since the linker told us it found a preempting
> definition in a non-IR object file.  So I'd expect a unresolved reference
> to 'hello' rather than LTO complaining about multiple definitions ...

Will you fix it? :)

> 
> Note gold is really unmaintained, so you should probably avoid using it.

Thanks. Will try lld instead.

[Bug lto/105133] New: lto/gold: lto failed to link --start-lib/--end-lib in gold

2022-04-01 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105133

Bug ID: 105133
   Summary: lto/gold: lto failed to link --start-lib/--end-lib in
gold
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoxhu at gcc dot gnu.org
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Hi, linker gold supports --start-lib and --end-lib to "mimics the
semantics of static libraries, but without needing to actually create
the archive file."(https://reviews.llvm.org/D66848).  Sometimes large
application may introduce multiple libraries from different repositories with
same source code, they would be linked into one binary finally, recently I
suffered from a link error with gold as linker and reduced an example as below:

cat hello.c
extern int hello(int a);
int main(void)
{
  return 0; /* hello(10); */
}

cat ./B/libhello.c
#include 
int hello(int a)
{
   puts("Hello");
   return 0;
}

cat ./C/libhello.c
#include 
int hello(int a)
{
   puts("Hello");
   return 0;
}


(1) NON lto link with gold is OK:

gcc -O2 -o ./B/libhello.c.o   -c ./B/libhello.c
gcc-ar qc ./B/libhello.a  ./B/libhello.c.o
gcc-ranlib ./B/libhello.a
gcc -O2 -o ./C/libhello.c.o   -c ./C/libhello.c
gcc-ar qc ./C/libhello.a  ./C/libhello.c.o
gcc-ranlib ./C/libhello.a
gcc hello.c -o hello.o -c -O2
gcc -o hellow hello.o -Wl,--start-lib ./B/libhello.c.o  -Wl,--end-lib
-Wl,--start-lib ./C/libhello.c.o -Wl,--end-lib -O2 -fuse-ld=gold


(2) lto link with gold fails with redefinition:
gcc -O2 -flto  -o ./B/libhello.c.o   -c ./B/libhello.c
gcc-ar qc ./B/libhello.a  ./B/libhello.c.o
gcc-ranlib ./B/libhello.a
gcc -O2 -flto  -o ./C/libhello.c.o   -c ./C/libhello.c
gcc-ar qc ./C/libhello.a  ./C/libhello.c.o
gcc-ranlib ./C/libhello.a
gcc hello.c -o hello.o -c -O2 -flto
gcc -o hellow hello.o -Wl,--start-lib ./B/libhello.c.o  -Wl,--end-lib
-Wl,--start-lib ./C/libhello.c.o -Wl,--end-lib -O2 -flto -fuse-ld=gold


./B/libhello.c:5:5: error: 'hello' has already been defined
5 | int hello(int a)
  | ^
./B/libhello.c:5:5: note: previously defined here
lto1: fatal error: errors during merging of translation units
compilation terminated.
lto-wrapper: fatal error: gcc returned 1 exit status
compilation terminated.
/usr/bin/ld.gold: fatal error: lto-wrapper failed
collect2: error: ld returned 1 exit status

This error happens at function gcc/lto/lto-symtab.c:lto_symtab_resolve_symbols,
simply remove the error_at line could work, but this may be not a reasonable
fix.  

  /* Find the single non-replaceable prevailing symbol and
 diagnose ODR violations.  */
  for (e = first; e; e = e->next_sharing_asm_name)
{
  if (!lto_symtab_resolve_can_prevail_p (e))
continue;

  /* If we have a non-replaceable definition it prevails.  */
  if (!lto_symtab_resolve_replaceable_p (e))
{
  if (prevailing)
{
  error_at (DECL_SOURCE_LOCATION (e->decl),
"%qD has already been defined", e->decl);
  inform (DECL_SOURCE_LOCATION (prevailing->decl),
  "previously defined here");
}
  prevailing = e;
}
}


cat hellow.res
3
hello.o 2
192 ccb9165e03755470 PREVAILING_DEF main
197 ccb9165e03755470 PREVAILING_DEF_IRONLY s
./B/libhello.c.o 1
205 68e0b97e93a52d7a PREEMPTED_REG hello
./C/libhello.c.o 1
205 18fe2d3482bfb511 PREEMPTED_REG hello


Secondly, If call hello(10) in hello.c , there will be NO error reported out.
The difference is the resolution type is changed from PREEMPTED_REG to
RESOLVED_IR/PREVAILING_DEF_IRONLY.  

3
hello.o 3
192 19ef867d12f62129 PREVAILING_DEF main
197 19ef867d12f62129 PREVAILING_DEF_IRONLY s
201 19ef867d12f62129 RESOLVED_IR hello
./B/libhello.c.o 1
205 23c5c855935478ce PREVAILING_DEF_IRONLY hello
./C/libhello.c.o 1
205 abbf050f5c23b448 PREEMPTED_REG hello


Is this a valid bug? Thanks.

[Bug target/102239] powerpc suboptimal boolean test of contiguous bits

2022-01-11 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32

2022-01-11 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from luoxhu at gcc dot gnu.org ---
Fixed by

The master branch has been updated by Xiong Hu Luo :

https://gcc.gnu.org/g:0552605b7b27dc6beed62e71bd05bc1efd191c0d

commit r12-6430-g0552605b7b27dc6beed62e71bd05bc1efd191c0d
Author: Xionghu Luo 
Date:   Mon Jan 10 20:05:56 2022 -0600

testsuite: Fix regression on m32 by r12-6087 [PR103820]

r12-6087 will avoid move cold bb out of hot loop, while the original
intent of this testcase is to hoist divides out of loop and CSE them to
only one divide.  So increase the loop count to turn the cold bb to hot
bb again.  Then the 3 divides could be rewritten with same reciptmp.

Tested pass on Power-Linux {32,64}, x86 {64,32} and i686-linux.

gcc/testsuite/ChangeLog:

PR testsuite/103820
* gcc.dg/tree-ssa/recip-3.c: Adjust.

[Bug bootstrap/103820] [12 Regression] i686 failed to bootstrap with ada by r12-6077

2022-01-11 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103820

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #7 from luoxhu at gcc dot gnu.org ---
(In reply to CVS Commits from comment #6)
> The master branch has been updated by Xiong Hu Luo :
> 
> https://gcc.gnu.org/g:0552605b7b27dc6beed62e71bd05bc1efd191c0d
> 
> commit r12-6430-g0552605b7b27dc6beed62e71bd05bc1efd191c0d
> Author: Xionghu Luo 
> Date:   Mon Jan 10 20:05:56 2022 -0600
> 
> testsuite: Fix regression on m32 by r12-6087 [PR103820]
> 
> r12-6087 will avoid move cold bb out of hot loop, while the original
> intent of this testcase is to hoist divides out of loop and CSE them to
> only one divide.  So increase the loop count to turn the cold bb to hot
> bb again.  Then the 3 divides could be rewritten with same reciptmp.
> 
> Tested pass on Power-Linux {32,64}, x86 {64,32} and i686-linux.
> 
> gcc/testsuite/ChangeLog:
> 
> PR testsuite/103820
> * gcc.dg/tree-ssa/recip-3.c: Adjust.

Typo. should be PR103802.

[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32

2022-01-06 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802

--- Comment #6 from luoxhu at gcc dot gnu.org ---
(In reply to Richard Biener from comment #5)
> So the point is that P is invariant but we do not hoist it because it's
> computed in a (estimated) cold block?  I notice that the condition is
> invariant, too, so
> in principle we could hoist as
> 
>   if (d > 0.01)
> P = ( W < E ) ? (W - E)/d : (E - W)/d;
>   for (i=0; i < 2; i++ )
> if( d > 0.01 )
>   F[i] += P;


Yes. But this loop only iterates twice, so bbs in loop is colder than
preheader.
-funswitch-loops should move the condition out of loop, but also need increase
the loop iteration count:

"/home/luoxhu/workspace/gcc-master/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c:16:14:
note: Not unswitching, loop is not expected to iterate"

> 
> alternatively one might argue that invariant expressions (unconditionally
> computed or in a special way under invariant conditions) should be costed
> differently.
> 
> I think best would be to restore the original intent of the testcase which
> was added with the fix for PRs 23109, 23948 and 24123.  I suppose there
> we saw the invariant hoisted(?) and the loop unrolled so I would suggest
> to either apply the hoisting or the unrolling manually to the testcase.
> (just look at the PRs whether you get a better idea of the origin of the
> testcase).

To restore the original intent of the testcase, increase the loop count is
better than "either apply the hoisting or unrolling".  Change it from "2" to at
least "5" will turn the cold bb to hot bb, then the two divides could be
hoisted out in LIM pass again(Verified below change could both pass on
power-m32 and x86-i686):

(It is much reasonable than the other two directions as loop iteration count is
not key for the test code.)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 641c91e..a1d2d87 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-options "-O1 -fno-trapping-math -funsafe-math-optimizations
-fdump-tree-recip" } */

-double F[2] = { 0.0, 0.0 }, e;
+double F[5] = { 0.0, 0.0 }, e;

 /* In this case the optimization is interesting.  */
 float h ()
@@ -13,7 +13,7 @@ float h ()
d = 2.*e;
E = 1. - d;

-   for( i=0; i < 2; i++ )
+   for( i=0; i < 5; i++ )
if( d > 0.01 )
{
P = ( W < E ) ? (W - E)/d : (E - W)/d;
@@ -23,4 +23,4 @@ float h ()
F[0] += E / d;
 }

-/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */

[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32

2021-12-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Or restore the previous recip count check by comment out the if condition to
avoid bb in loop turns cold?

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 641c91e719e..d3c3053486d 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -14,7 +14,13 @@ float h ()
E = 1. - d;

for( i=0; i < 2; i++ )
-   if( d > 0.01 )
+   // if( d > 0.01 )
+   /* The if condition will make followed bb cold (profile count
+  less then the loop preheader), while r12-6087 is a
+  optimization that avoids move COLD invariant expression out
+  of loop, since this test case is to test recip expression
+  could be CSE and eliminated, so comment the condition to
keep
+  the test point.  */
{
P = ( W < E ) ? (W - E)/d : (E - W)/d;
F[i] += P;
@@ -23,4 +29,4 @@ float h ()
F[0] += E / d;
 }

-/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */

[Bug tree-optimization/103793] [12 Regression] ICE: in to_reg_br_prob_base, at profile-count.h:277 with -O3 -fno-guess-branch-probability since r12-6086-gcd5ae148c47c6dee

2021-12-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103793

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug rtl-optimization/94790] Failure to use andn in specific pattern in which it is available

2021-12-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94790

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Just noticed they are different case, scalar vs. vector...

[Bug rtl-optimization/94790] Failure to use andn in specific pattern in which it is available

2021-12-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94790

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #3 from luoxhu at gcc dot gnu.org ---
On Power, '(~mask & a) | (b & mask)' is better than 'a ^ ((a ^ b) & mask)' as
the first can be generated as one instruction 'xxsel' as PR90323 shows.

[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32

2021-12-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802

--- Comment #2 from luoxhu at gcc dot gnu.org ---
-funroll-loops could work around this, is this reasonable?

[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32

2021-12-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802

--- Comment #1 from luoxhu at gcc dot gnu.org ---
MOVE_MAX_PIECES is 4 on m32 but it is 8 on m64, then estimate_move_cost is
different between them 2 vs 1 for “((size + MOVE_MAX_PIECES - 1) /
MOVE_MAX_PIECES)".

recip-3.m32.c.172t.cunroll:

 BB: 11, after_exit: 0
 BB: 7, after_exit: 0
  size:   2 _4 = F[i_23];
  size:   1 _5 = _4 + iftmp.1_10;
  size:   2 F[i_23] = _5;
 BB: 5, after_exit: 0
  size:   1 _2 = d_14 +
1.00088817841970012523233890533447265625e-1;
  size:   1 reciptmp_12 = 1.0e+0 / d_14;
  size:   1 iftmp.1_18 = reciptmp_12 * _2;
 BB: 6, after_exit: 0
  size:   1 _3 = -1.00088817841970012523233890533447265625e-1 -
d_14;
  size:   1 reciptmp_25 = 1.0e+0 / d_14;
  size:   1 iftmp.1_17 = reciptmp_25 * _3;
 BB: 4, after_exit: 0
  size:   2 if (e.0_1 <
-5.00444089209850062616169452667236328125e-2)
size: 19-4, last_iteration: 19-4
  Loop size: 19
  Estimated size after unrolling: 20
Not unrolling loop 1: size would grow.


But recip-3.m64.c.172t.cunroll:

 BB: 11, after_exit: 0
 BB: 7, after_exit: 0
  size:   1 _4 = F[i_23];
  size:   1 _5 = _4 + iftmp.1_10;
  size:   1 F[i_23] = _5;
 BB: 5, after_exit: 0
  size:   1 _2 = d_14 +
1.00088817841970012523233890533447265625e-1;
  size:   1 reciptmp_12 = 1.0e+0 / d_14;
  size:   1 iftmp.1_18 = reciptmp_12 * _2;
 BB: 6, after_exit: 0
  size:   1 _3 = -1.00088817841970012523233890533447265625e-1 -
d_14;
  size:   1 reciptmp_25 = 1.0e+0 / d_14;
  size:   1 iftmp.1_17 = reciptmp_25 * _3;
 BB: 4, after_exit: 0
  size:   2 if (e.0_1 <
-5.00444089209850062616169452667236328125e-2)
size: 17-4, last_iteration: 17-4
  Loop size: 17
  Estimated size after unrolling: 17
Making edge 18->9 impossible by redistributing probability to other edges.
Making edge 8->10 impossible by redistributing probability to other edges.
/home/luoxhu/workspace/gcc-master/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c:16:14:
optimized: loop with 1 iterations completely unrolled (header execution count
357878154)

[Bug middle-end/103802] New: [12 regression] recip-3.c fails after r12-6087 on Power m32

2021-12-22 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802

Bug ID: 103802
   Summary: [12 regression] recip-3.c fails after  r12-6087 on
Power m32
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoxhu at gcc dot gnu.org
  Target Milestone: ---

Invoking the compiler as /home/luoxhu/workspace/gcc-master_build/gcc/xgcc
-B/home/luoxhu/workspace/gcc-master_build/gcc/
/home/luoxhu/workspace/gcc-master/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c 
-fdiagnostics-plain-output   -O1 -fno-trapping-math -funsafe-math-optimizations
-fdump-tree-recip -S  -m32  -o recip-3.s
Executing on host: /home/luoxhu/workspace/gcc-master_build/gcc/xgcc
-B/home/luoxhu/workspace/gcc-master_build/gcc/
/home/luoxhu/workspace/gcc-master/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c 
-fdiagnostics-plain-output   -O1 -fno-trapping-math -funsafe-math-optimizations
-fdump-tree-recip -S  -m32  -o recip-3.s(timeout = 300)
gcc.dg/tree-ssa/recip-3.c: pattern found 3 times
FAIL: gcc.dg/tree-ssa/recip-3.c scan-tree-dump-times recip " / " 5


Reson is m32 fail to cunroll due to
 recip-3.m32.c.172t.cunroll:   Not unrolling loop 1: size would grow.

[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526

2021-12-21 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug tree-optimization/103793] [12 Regression] ICE: in to_reg_br_prob_base, at profile-count.h:277 with -O3 -fno-guess-branch-probability since r12-6086-gcd5ae148c47c6dee

2021-12-21 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103793

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |luoxhu at gcc dot 
gnu.org

--- Comment #2 from luoxhu at gcc dot gnu.org ---

Confirmed. -fno-guess-branch-probability requires the profile_count be
initialized, so add guard like this?


+   if (true_edge->probability.initialized_p ())
+ {
+   edge exit_to_latch1 = single_pred_edge (loop1->latch);
+   exit_to_latch1->probability
+ = exit_to_latch1->probability.apply_scale (
+   true_edge->probability.to_reg_br_prob_base (),
+   REG_BR_PROB_BASE);
+   single_exit (loop1)->probability
+ = exit_to_latch1->probability.invert ();
+ }

[Bug middle-end/102860] [12 regression] libgomp.fortran/simd2.f90 ICEs after r12-4526

2021-12-14 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102860

--- Comment #6 from luoxhu at gcc dot gnu.org ---
Fortran's modulo is floor_mod as documented here:
https://gcc.gnu.org/onlinedocs/gfortran/MODULO.html?

Syntax:
RESULT = MODULO(A, P)

Return value:
The type and kind of the result are those of the arguments. (As a GNU
extension, kind is the largest kind of the actual arguments.)

If A and P are of type INTEGER:
MODULO(A,P) has the value R such that A=Q*P+R, where Q is an integer and R is
between 0 (inclusive) and P (exclusive).

If A and P are of type REAL:
MODULO(A,P) has the value of A - FLOOR (A / P) * P.

The returned value has the same sign as P and a magnitude less than the
magnitude of P.


program test_modulo
  print *, modulo(17,3)
  print *, modulo(17.5,5.5)

  print *, modulo(-17,3)
  print *, modulo(-17.5,5.5)

  print *, modulo(17,-3)
  print *, modulo(17.5,-5.5)
end program


LD_LIBRARY_PATH=./x86_64-pc-linux-gnu/libgfortran/.libs/ ./a.out

   2
   1.
   1
   4.5000
  -1
  -4.5000

[Bug middle-end/102860] [12 regression] libgomp.fortran/simd2.f90 ICEs after r12-4526

2021-12-14 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102860

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #5 from luoxhu at gcc dot gnu.org ---
P8, P9 and X86 doesn't vectorize the floor_mod operation, so they passed.
The fix in #c2 only fixes ICE, but execution still fails, reason is R239 is
used but not defined.

[Bug target/102239] powerpc suboptimal boolean test of contiguous bits

2021-11-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239

--- Comment #11 from luoxhu at gcc dot gnu.org ---


+(define_insn_and_split "*anddi3_insn_dot"
+ [(set (pc)
+(if_then_else (eq (and:DI (match_operand:DI 1 "gpc_reg_operand" "%r,r")
+ (match_operand:DI 2 "const_int_operand" "n,n"))
+ (const_int 0))
+ (label_ref (match_operand 3 ""))
+ (pc)))
+  (clobber (match_scratch:DI 0 "=r,r"))]
+  "rs6000_is_valid_2insn_and (operands[2], DImode)
+   && !(rs6000_is_valid_and_mask (operands[2], DImode)
+   || logical_const_operand (operands[2], DImode))"
+  "#"
+  "&& reload_completed"
+  [(pc)]
+{
+   int nb, ne;
+   if (rs6000_is_valid_mask (operands[2], , , DImode) && nb >= ne)
+ {
+   unsigned HOST_WIDE_INT val = INTVAL (operands[2]);
+   int shift = 63 - nb;
+   rtx tmp = gen_rtx_ASHIFT (DImode, operands[1], GEN_INT (shift));
+   tmp = gen_rtx_AND (DImode, tmp, GEN_INT (val << shift));
+   rtx cr0 = gen_rtx_REG (CCmode, CR0_REGNO);
+   rs6000_emit_dot_insn (operands[0], tmp, 1, cr0);
+   rtx loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[3]);
+   rtx cond = gen_rtx_EQ (CCEQmode, cr0, const0_rtx);
+   rtx ite = gen_rtx_IF_THEN_ELSE (VOIDmode, cond, loc_ref, pc_rtx);
+   emit_jump_insn (gen_rtx_SET (pc_rtx, ite));
+   DONE;
+ }
+   else
+ FAIL;
+}
+  [(set_attr "type" "shift")
+   (set_attr "dot" "yes")
+   (set_attr "length" "8,12")])
+


This pattern could combine the two instructions from

 9: {r123:CC=cmp(r124:DI&0x6,0);clobber scratch;}
   REG_DEAD r124:DI
 10: pc={(r123:CC==0)?L15:pc}
  REG_DEAD r123:CC

to:

 10: {pc={(r124:DI&0x6==0)?L15:pc};clobber scratch;}

then split2 will split it to one rotate dot instruction, is this OK?


(insn 32 9 33 2 (parallel [
(set (reg:CC 100 0)
(compare:CC (and:DI (ashift:DI (reg:DI 3 3 [124])
(const_int 29 [0x1d]))
(const_int -4611686018427387904 [0xc000]))
(const_int 0 [0])))
(clobber (reg:DI 3 3 [125]))
]) "pr102239.c":4:6 239 {*rotldi3_mask_dot}
 (nil))
(jump_insn 33 32 11 2 (set (pc)
(if_then_else (eq:CCEQ (reg:CC 100 0)
(const_int 0 [0]))
(label_ref 15)
(pc))) "pr102239.c":4:6 869 {*cbranch}
 (int_list:REG_BR_PROB 536870916 (nil))
 -> 15)

[Bug target/102239] powerpc suboptimal boolean test of contiguous bits

2021-11-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239

--- Comment #9 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #8)
> (In reply to luoxhu from comment #6)
> > > > foo:
> > > > .LFB0:
> > > > .cfi_startproc
> > > > rldicr. 3,3,29,1
> > > > beq 0,.L2
> > > 
> > > This is fine, but only because it tests the EQ bit (not the LT or GT 
> > > bits).
> > > So the generated RTL for this insn (the 2insn one) is not correct.
> > 
> > The generated RTL in pr102239.c.300r.split2 is:
> > 
> > (insn 32 8 33 2 (parallel [
> > (set (reg:CC 100 0 [123])
> > (compare:CC (and:DI (ashift:DI (reg:DI 3 3 [124])
> > (const_int 29 [0x1d]))
> > (const_int -4611686018427387904
> > [0xc000]))
> > (const_int 0 [0])))
> > (clobber (reg:DI 3 3 [125]))
> > ]) "pr102239.c":4:6 238 {*rotldi3_mask_dot}
> >  (nil))
> > (insn 33 32 10 2 (set (reg:DI 3 3 [125])
> > (lshiftrt:DI (reg:DI 3 3 [125])
> > (const_int 29 [0x1d]))) "pr102239.c":4:6 278 {lshrdi3}
> >  (nil))
> > (jump_insn 10 33 11 2 (set (pc)
> > (if_then_else (eq (reg:CC 100 0 [123])
> > (const_int 0 [0]))
> > (label_ref 15)
> > (pc))) "pr102239.c":4:6 868 {*cbranch}
> >  (int_list:REG_BR_PROB 536870916 (nil))
> >  -> 15)
> 
> So combine will have to look at insn 10 as well when it does the combination
> (it often already does, via "other_insn") -- but also it does have to know
> an "eq" is okay here, and that requires a new pattern.
> 
> > rotldi3_mask_dot is what you mentioned in c#1, it is a shifted result and
> > not matter for comparing to 0:
> 
> It does matter, if what you are want to see is if it is smaller than zero or
> greater than zero.  CCmode includes those things.  There is a CCEQmode for
> if only the EQ bit is set correctly.

Got it, thanks. As the example in c#7.  If CCmode is LT, rotate data to highest
bits will get negative result and set CR0 to negative, which is unexpected. 


> 
> > > *rotl3_mask_dot cannot do this either; the base and the dot2 of that
> > > cannot be done, they return a shifted result, but that doesn't matter for
> > > comparing it to 0.  So we should add a specialised version.
> > 
> > What specialized version to add?
> 
> Some pattern that just does this as an rldicr, as a single insn.  It will
> have to be excluded by the 2insn thing (it is only a single insn itself!),
> and it will have to have comparison mode CCEQ only.


I was motivated by the clang code, and tried to rotate the data to LSB instead,
it doesn't suffer from CCmode issue again?  Will this be simpler than the
combine & new pattern solution?

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index c9ce0550df1..d2a5b916b1d 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -11747,11 +11747,11 @@ rs6000_emit_2insn_and (machine_mode mode, rtx
*operands, bool expand, int dot)
}
   else
{
- rtx tmp = gen_rtx_ASHIFT (mode, operands[1], GEN_INT (shift));
- tmp = gen_rtx_AND (mode, tmp, GEN_INT (val << shift));
- emit_move_insn (operands[0], tmp);
- tmp = gen_rtx_LSHIFTRT (mode, operands[0], GEN_INT (shift));
+ rtx tmp = gen_rtx_LSHIFTRT (mode, operands[1], GEN_INT (ne));
+ tmp = gen_rtx_AND (mode, tmp, GEN_INT (val >> ne));
  rs6000_emit_dot_insn (operands[0], tmp, dot, dot ? operands[3] : 0);
+ tmp = gen_rtx_ASHIFT (mode, operands[0], GEN_INT (ne));
+ emit_move_insn (operands[0], tmp);
}
   return;


RTL  pr102239.c.300r.split2:

(insn 32 8 33 2 (parallel [
(set (reg:CC 100 0 [123])
(compare:CC (and:DI (lshiftrt:DI (reg:DI 3 3 [124])
(const_int 33 [0x21]))
(const_int 3 [0x3]))
(const_int 0 [0])))
(clobber (reg:DI 3 3 [125]))
]) "pr102239.c":4:6 238 {*rotldi3_mask_dot}
 (nil))
(insn 33 32 10 2 (set (reg:DI 3 3 [125])
(ashift:DI (reg:DI 3 3 [125])
(const_int 33 [0x21]))) "pr102239.c":4:6 268 {ashldi3}
 (nil))
(jump_insn 10 33 11 2 (set (pc)
(if_then_else (eq (reg:CC 100 0 [123])
(const_int 0 [0]))
(label_ref 15)
(pc))) "pr102239.c":4:6 868 {*cbranch}
 (int_list:REG_BR_PROB 536870916 (nil))
 -> 15)


ASM pr102239.s:

foo:
.LFB0:
.cfi_startproc
rldicl. 3,3,31,62
beq 0,.L2
#APP
 # 5 "pr102239.c" 1
# if
 # 0 "" 2
#NO_APP
blr
.p2align 4,,15
.L2:
#APP

[Bug target/102239] powerpc suboptimal boolean test of contiguous bits

2021-11-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239

--- Comment #7 from luoxhu at gcc dot gnu.org ---
 1| Dump of assembler code for function foo:
 2|0x15e0 <+0>: rldicr. r3,r3,29,1
 3+>   0x15e4 <+4>: beq 0x15f0 
 4|0x15e8 <+8>: blr
 5|0x15ec <+12>:ori r2,r2,0
 6|0x15f0 <+16>:blr
 7|0x15f4 <+20>:.long 0x0
 8|0x15f8 <+24>:.long 0x0

(gdb) si
0x15e4 in foo ()
1: /x $r3 = 0xc000
2: /x $cr = 0x82000282

cr0 is negative if only rotldi3_mask_dot, but it was 0x42000282 on master code.


BTW, clang also generated instructions with two rorates:

foo(long):# @foo(long)
rldicl 3, 3, 31, 33
rldicl. 3, 3, 33, 29
beq 0, .LBB0_2
blr
.LBB0_2:
blr
.long   0
.quad   0

[Bug target/102239] powerpc suboptimal boolean test of contiguous bits

2021-11-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239

--- Comment #6 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #5)
> (In reply to luoxhu from comment #4)
> > Simply adjust the sequence of dot instruction could produce expected code,
> > is this correct?
> 
> No it isn't.  Sorry.

Sorry I don't understand what is wrong...

> 
> > foo:
> > .LFB0:
> > .cfi_startproc
> > rldicr. 3,3,29,1
> > beq 0,.L2
> 
> This is fine, but only because it tests the EQ bit (not the LT or GT bits).
> So the generated RTL for this insn (the 2insn one) is not correct.

The generated RTL in pr102239.c.300r.split2 is:

(insn 32 8 33 2 (parallel [
(set (reg:CC 100 0 [123])
(compare:CC (and:DI (ashift:DI (reg:DI 3 3 [124])
(const_int 29 [0x1d]))
(const_int -4611686018427387904 [0xc000]))
(const_int 0 [0])))
(clobber (reg:DI 3 3 [125]))
]) "pr102239.c":4:6 238 {*rotldi3_mask_dot}
 (nil))
(insn 33 32 10 2 (set (reg:DI 3 3 [125])
(lshiftrt:DI (reg:DI 3 3 [125])
(const_int 29 [0x1d]))) "pr102239.c":4:6 278 {lshrdi3}
 (nil))
(jump_insn 10 33 11 2 (set (pc)
(if_then_else (eq (reg:CC 100 0 [123])
(const_int 0 [0]))
(label_ref 15)
(pc))) "pr102239.c":4:6 868 {*cbranch}
 (int_list:REG_BR_PROB 536870916 (nil))
 -> 15)


rotldi3_mask_dot is what you mentioned in c#1, it is a shifted result and not
matter for comparing to 0:

> *rotl3_mask_dot cannot do this either; the base and the dot2 of that
> cannot be done, they return a shifted result, but that doesn't matter for
> comparing it to 0.  So we should add a specialised version.

What specialized version to add?

[Bug target/102239] powerpc suboptimal boolean test of contiguous bits

2021-11-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Simply adjust the sequence of dot instruction could produce expected code, is
this correct?


foo:
.LFB0:
.cfi_startproc
rldicr. 3,3,29,1
beq 0,.L2
#APP
 # 10 "pr102239.c" 1
# if
 # 0 "" 2
#NO_APP
blr
.p2align 4,,15
.L2:
#APP
 # 12 "pr102239.c" 1
# else
 # 0 "" 2
#NO_APP
blr



 git diff
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index c9ce0550df1..2f0b5992bbf 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -11749,9 +11749,9 @@ rs6000_emit_2insn_and (machine_mode mode, rtx
*operands, bool expand, int dot)
{
  rtx tmp = gen_rtx_ASHIFT (mode, operands[1], GEN_INT (shift));
  tmp = gen_rtx_AND (mode, tmp, GEN_INT (val << shift));
- emit_move_insn (operands[0], tmp);
- tmp = gen_rtx_LSHIFTRT (mode, operands[0], GEN_INT (shift));
  rs6000_emit_dot_insn (operands[0], tmp, dot, dot ? operands[3] : 0);
+ tmp = gen_rtx_LSHIFTRT (mode, operands[0], GEN_INT (shift));
+ emit_move_insn (operands[0], tmp);
}
   return;
 }

[Bug target/102239] powerpc suboptimal boolean test of contiguous bits

2021-11-23 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #2 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #1)
> Confirmed.
> 
> So the relevant insn
> 
> (parallel [(set (reg:CC 123)
> (compare:CC (and:DI (reg:DI 124)
> (const_int 25769803776 [0x6]))
> (const_int 0 [0])))
>(clobber (scratch:DI))])
> 
> is matched by *and3_2insn but not by any pattern that ends up as just
> one insn.  Not *and3_mask_dot, because that doesn't do a shift first,
> is just an AND and there are no machine insns to do that; but there is no
> pattern for what we can do.
> 
> *rotl3_mask_dot cannot do this either; the base and the dot2 of that
> cannot be done, they return a shifted result, but that doesn't matter for
> comparing it to 0.  So we should add a specialised version.

Seems different with what you describe, in combine, it was combined to
anddi3_2insn_dot:

(insn 9 8 10 2 (parallel [
(set (reg:CC 122)
(compare:CC (and:DI (reg:DI 123)
(const_int 25769803776 [0x6]))
(const_int 0 [0])))
(clobber (scratch:DI))
]) "pr102239.c":3:6 210 {*anddi3_2insn_dot}
 (expr_list:REG_DEAD (reg:DI 123)
(nil)))
(jump_insn 10 9 11 2 (set (pc)
(if_then_else (eq (reg:CC 122)
(const_int 0 [0]))
(label_ref 15)
(pc))) "pr102239.c":3:6 868 {*cbranch}
 (expr_list:REG_DEAD (reg:CC 122)
(int_list:REG_BR_PROB 536870916 (nil)))



Then in pr102239.c.302r.split2, it is split by "*and3_2insn_dot" to
rotldi3_mask+lshrdi3_dot:

Splitting with gen_split_80 (rs6000.md:3721)

(insn 32 8 33 2 (set (reg:DI 3 3 [124])
(and:DI (ashift:DI (reg:DI 3 3 [123])
(const_int 29 [0x1d]))
(const_int -4611686018427387904 [0xc000])))
"pr102239.c":3:6 236 {*rotldi3_mask}
 (nil))
(insn 33 32 10 2 (parallel [
(set (reg:CC 100 0 [122])
(compare:CC (lshiftrt:DI (reg:DI 3 3 [124])
(const_int 29 [0x1d]))
(const_int 0 [0])))
(clobber (reg:DI 3 3 [124]))
]) "pr102239.c":3:6 281 {*lshrdi3_dot}
 (nil))


Why this difference happens?

0x6 is not a valid mask for anddi3_2insn_dot:


 "(mode == Pmode || UINTVAL (operands[2]) <= 0x7fff)
   && rs6000_is_valid_2insn_and (operands[2], mode)
   && !(rs6000_is_valid_and_mask (operands[2], mode)
|| logical_const_operand (operands[2], mode))"


(gdb) p UINTVAL (operands[2]) <= 0x7fff
$84 = false
(gdb) p rs6000_is_valid_2insn_and (operands[2], E_DImode)
$85 = true
(gdb) p logical_const_operand (operands[2], E_DImode)
$86 = false
(gdb) p rs6000_is_valid_and_mask (operands[2], E_DImode)
$87 = false
(gdb) p Pmode
$88 = DImode

[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526

2021-11-22 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270

--- Comment #5 from luoxhu at gcc dot gnu.org ---
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 4 5 6 11 7 8 10 9
;;
;; Loop 1
;;  header 8, latch 7
;;  depth 1, outer 0
;;  nodes: 8 7 6 10 5 4 11 3
;;
;; Loop 2
;;  header 6, latch 5
;;  depth 2, outer 1
;;  nodes: 6 5 4 11 3
;;
;; Loop 3
;;  header 4, latch 3
;;  depth 3, outer 2
;;  nodes: 4 3
;; 2 succs { 8 }
;; 3 succs { 4 }
;; 4 succs { 3 5 }
;; 5 succs { 6 }
;; 6 succs { 11 7 }
;; 11 succs { 4 }
;; 7 succs { 8 }
;; 8 succs { 10 9 }
;; 10 succs { 6 }
;; 9 succs { 1 }

The CFG is:

2
|
8<
| \  |
10 9 |
||
67
6<
|| 
11   | 
||
4<-  | 
| \| |
5  3 |
||
--

When iterating loop 3 in predict_extra_loop_exits, exit edge is 4->5, it finds
edge 3->4 for statement "if (d_8 == 0)", and set all e->src->preds with
"predict_paths_leading_to_edge (e1, PRED_LOOP_EXTRA_EXIT, NOT_TAKEN);".

(gdb) pbb 3
;; basic block 3, loop depth 3
;;  pred:   4
_1 = *i_19(D);
_2 = a_4 & c_6;
_3 = _1 + _2;
*i_19(D) = _3;
;;  succ:   4

(gdb) pbb 4
;; basic block 4, loop depth 3
;;  pred:   11
;;  3
# c_6 = PHI 
# d_8 = PHI <0(11), 1(3)>
if (d_8 == 0)
  goto ; [INV]
else
  goto ; [INV]
;;  succ:   3
;;  5 
(gdb) p e->src->preds
$16 = 0x74fba140 = { 3)>}

[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526

2021-11-22 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Created attachment 51851
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51851=edit
Fix incorrect loop exit edge probability

[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526

2021-11-22 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270

--- Comment #3 from luoxhu at gcc dot gnu.org ---
The profile count is correct but something wrong with edge probability, and it
turns out that r12-4526 exposes a long-existing issue in
profile_estimate:predict_extra_loop_exits, when searching extra exit edges for
inner loop, it goes out and find a edge belongs to *outer loop*, setting that
edge with predict value 33%, then predict_loops won't reset that edge for outer
loop.
I drafted a patch to ignore EDGE_DFS_BACK edges when iterating in
predict_extra_loop_exits, then inner loop becomes hot again.

diff base/pr103270.c.047t.profile_estimate
patched/pr103270.c.047t.profile_estimate  -U15

 Predictions for bb 5
 1 edges in bb 5 predicted to even probabilities
 Predictions for bb 6
-  first match heuristics: 33.00%
-  combined heuristics: 33.00%
+  first match heuristics: 91.67%
+  combined heuristics: 91.67%
   opcode values nonequal (on trees) heuristics of edge 6->11 (ignored): 66.00%
-  extra loop exit heuristics of edge 6->11: 33.00%
+  loop iterations heuristics of edge 6->7: 8.33%
 Predictions for bb 11
 1 edges in bb 11 predicted to even probabilities
 Predictions for bb 7
 1 edges in bb 7 predicted to even probabilitie

…

-   [local count: 88915474]:
+   [local count: 6029625]:
   goto ; [100.00%]

-   [local count: 354334800]:
+   [local count: 536870913]:
   _1 = *i_19(D);
   _2 = a_4 & c_6;
   _3 = _1 + _2;
   *i_19(D) = _3;

-   [local count: 708669601]:
+   [local count: 1073741824]:
   # c_6 = PHI 
   # d_8 = PHI <0(11), 1(3)>
   if (d_8 == 0)
 goto ; [50.00%]
   else
 goto ; [50.00%]

-   [local count: 354334800]:
+   [local count: 536870913]:
   # c_21 = PHI 
   b_18 = b_5 + -1;

-   [local count: 1073741824]:
+   [local count: 585656064]:
   # b_5 = PHI <0(10), b_18(5)>
   # c_7 = PHI <0(10), c_21(5)>
   if (b_5 != -11)
-goto ; [33.00%]
+goto ; [91.67%]
   else
-goto ; [67.00%]
+goto ; [8.33%]

-   [local count: 354334800]:
+   [local count: 536870913]:
   goto ; [100.00%]

-   [local count: 719407024]:
+   [local count: 48785151]:
   a_16 = a_4 + 1;

-   [local count: 808322498]:
+   [local count: 54814777]:
   # a_4 = PHI 
   if (a_4 <= 4)
 goto ; [89.00%]
   else
 goto ; [11.00%]

-   [local count: 719407024]:
+   [local count: 48785151]:
   goto ; [100.00%]

-   [local count: 88915474]:
+   [local count: 6029625]:
   return;

[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526

2021-11-16 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270

--- Comment #2 from luoxhu at gcc dot gnu.org ---
(In reply to Richard Biener from comment #1)
> So you say this is a problem with loop header copying, that would mean the
> issue is really latent and general, no?  Header copying uses
> gimple_duplicate_sese_region and has no own profile updating.  I guess its
> profile updating code isn't designed to cope with copying a region with
> "side"-entries (we are ignoring the backedge here).  Not sure if we can
> somehow generally handle those (maybe we can learn from tracer or threader
> here).
> 
> Honza?

Yes, it seems to be a general issue in gimple_duplicate_sese_region, the inner
loop cfg was:

8
|
3<-- 
| \ |
5  4  

And it is modified by ch_base::copy_headers->gimple_duplicate_sese_region to(
entry edge is 8->3, exit edge is 3->4):

8
|
12
|
4<-- 
|   |
3---
|
5

bb 12 is copied block from bb 3 as new preheader, bb 3 is rotated to be new
exit of the loop, bb 3 and bb 12 are adjusted count to "total_count -
entry_count" (354334800) and "entry_count"(719407024), at last bb 3 and bb 4
will be merged to one block by gimple_merge_blocks later by TODO_cleanup_cfg
with much smaller
count than preheader.


gimple_duplicate_sese_region:

  if (total_count.initialized_p () && entry_count.initialized_p ())
{
  scale_bbs_frequencies_profile_count (region, n_region,
   total_count - entry_count,
   total_count);
  scale_bbs_frequencies_profile_count (region_copy, n_region, entry_count,
   total_count);
}


Obviously, region of bb 3's profile count shouldn't be decreased from
"total_count" to "total_count - entry_count", it executes at every execution of
the loop.  Simply adjust it back to total_count and region_copy to entry_count
will cause some other cases fail. And at the moment edge 3->4 is still not a
backedge now?

[Bug testsuite/103270] New: [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526

2021-11-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270

Bug ID: 103270
   Summary: [12 regression] gcc.dg/vect/pr96698.c inner loop
turned from hot to cold after r12-4526
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: testsuite
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoxhu at gcc dot gnu.org
  Target Milestone: ---

For the testcase gcc.dg/vect/pr96698.c, the inner loop was hot (preheader count
< loop count), but it is NOT now after r12-4526, bb 3's profile count 354334801
is only 1/2 of the preheader bb 5's profile count 719407024.

But I guess it should be fixed in tree-ssa-loop-ch.c when copy_headers, there
are profile count update there, this case should be handled specially when the
single exit loop only has two bbs and the old header is new exit->src, no need
to scale down the old header profile count to preserve the hotness of the loop.


pr96698.c.138t.lim2:
void test (int a, int * i)
{
  int i__lsm.5;
  int c;
  int b;
  int _22;
  int _23;
  int _24;   [local count: 88915474]:
  if (a_12(D) <= 4)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 79134772]:
  i__lsm.5_11 = *i_16(D);
  goto ; [100.00%]

   [local count: 116930484]:

   [local count: 354334801]:
  # b_3 = PHI 
  # c_17 = PHI 
  # i__lsm.5_20 = PHI 
  _22 = i__lsm.5_20;
  _23 = a_2 & c_17;
  _24 = _22 + _23;
  i__lsm.5_4 = _24;
  b_15 = b_3 + -1;
  if (b_15 != -11)
goto ; [33.00%]
  else
goto ; [67.00%]

   [local count: 719407024]:
  # i__lsm.5_7 = PHI 
  a_14 = a_2 + 1;
  if (a_14 <= 4)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 640272252]:

   [local count: 719407024]:
  # a_2 = PHI 
  # i__lsm.5_1 = PHI 
  goto ; [100.00%]

   [local count: 79134772]:
  # i__lsm.5_5 = PHI 
  *i_16(D) = i__lsm.5_5;

   [local count: 88915474]:
  return;
}

[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757

2021-11-08 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from luoxhu at gcc dot gnu.org ---
Fixed and backported to gcc-11.

[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757

2021-11-04 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991

--- Comment #7 from luoxhu at gcc dot gnu.org ---
Fixed, will backport to gcc-11 in a week.

[Bug tree-optimization/103029] [12 regression] gcc.dg/vect/pr82436.c ICEs on r12-4818

2021-11-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103029

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ro at gcc dot gnu.org

--- Comment #8 from luoxhu at gcc dot gnu.org ---
*** Bug 103041 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/103041] [12 regression] gcc.dg/vect/slp-reduc-10a.c etc. FAIL

2021-11-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103041

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #5 from luoxhu at gcc dot gnu.org ---
duplicate and fixed.

*** This bug has been marked as a duplicate of bug 103029 ***

[Bug tree-optimization/103041] [12 regression] gcc.dg/vect/slp-reduc-10a.c etc. FAIL

2021-11-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103041

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Could you please verify whether it is caused by r12-4818 instead of r12-4819?
r12-4819 is a NFC patch which seems more unlikely,  and r12-4818 also ICEs in
PR103029, it is possibly a duplicate of that.


commit f35af8df241a9eb9c2edf7da26d3c5f53d6e2511
Author: Xionghu Luo 
Date:   Mon Nov 1 00:12:36 2021 -0500

Refactor loop_version

[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757

2021-11-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991

--- Comment #5 from luoxhu at gcc dot gnu.org ---
P9:

.L149:
lxvx %vs32,%r8,%r10
vadduwm %v12,%v12,%v1
mfvsrd %r5,%vs43
mfvsrld %r4,%vs43
vadduwm %v11,%v11,%v9
stxv %vs44,112(%r1)
xxperm %vs32,%vs32,%vs42
vcmpequw %v13,%v0,%v1
vadduwm %v0,%v1,%v0
xxlandc %vs45,%vs33,%vs45  // here.
xxperm %vs32,%vs32,%vs42
xxlor %vs0,%vs0,%vs45
stxvx %vs32,%r8,%r10
stxv %vs0,128(%r1)
addi %r8,%r8,-16
bdnz .L149


$vs43 is not changed by xxlandc

[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757

2021-11-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991

--- Comment #4 from luoxhu at gcc dot gnu.org ---

vect-simd-17.p10.c.335r.final:
3379: %v1:V16QI=unspec[%v1:V16QI,%v1:V16QI,%v9:V16QI] 254
3372: {%v11:V4SI=~%v0:V4SI&%v13:V4SI|%v11:V4SI;clobber %r10:V4SI;}  // wrong
code.
  REG_DEAD %v0:V4SI
  REG_UNUSED %r10:V4SI
3373: [%r1:DI+0x80]=%v11:V4SI


ASM:

.L149:
lxvx %vs32,%r9,%r8
vadduwm %v12,%v12,%v13
mfvsrd %r5,%vs42
mfvsrld %r4,%vs42
vadduwm %v10,%v10,%v8
stxv %vs44,112(%r1)
xxperm %vs32,%vs32,%vs41
vadduwm %v1,%v13,%v0
vcmpequw %v0,%v0,%v13
xxperm %vs33,%vs33,%vs41
vandc %r10,%v13,%v0   // wrong code
vor %v11,%r10,%v11// wrong code
stxv %vs43,128(%r1)
stxvx %vs33,%r9,%r8
addi %r8,%r8,-16
bdnz .L149

But the binary is (/opt/binutils-power10/bin/objdump -d vect-simd-17.p10 |
less):

10002ea0:   19 42 09 7c lxvxvs32,r9,r8
10002ea4:   80 68 8c 11 vadduwm v12,v12,v13
10002ea8:   67 00 45 7d mfvrd   r5,v10
10002eac:   67 02 44 7d mfvsrld r4,vs42
10002eb0:   80 40 4a 11 vadduwm v10,v10,v8
10002eb4:   7d 00 81 f5 stxvvs44,112(r1)
10002eb8:   d7 48 00 f0 xxperm  vs32,vs32,vs41
10002ebc:   80 00 2d 10 vadduwm v1,v13,v0
10002ec0:   86 68 00 10 vcmpequw v0,v0,v13
10002ec4:   d7 48 21 f0 xxperm  vs33,vs33,vs41
10002ec8:   44 04 4d 11 vandc   v10,v13,v0// wrong code
10002ecc:   84 5c 6a 11 vor v11,v10,v11   // wrong code
10002ed0:   8d 00 61 f5 stxvvs43,128(r1)
10002ed4:   19 43 29 7c stxvx   vs33,r9,r8
10002ed8:   f0 ff 08 39 addir8,r8,-16
10002edc:   c4 ff 00 42 bdnz10002ea0 

%vs42 is a global constant data loaded from memory, it was modified at address
0x10002ec8, there r10 is changed to v10 from ASM to binary, which was supposed
to be never change in the loop.


(gdb)
   0x10002eb4 :  7d 00 81 f5 stxvvs44,112(r1)
   0x10002eb8 :  d7 48 00 f0 xxperm  vs32,vs32,vs41
   0x10002ebc :  80 00 2d 10 vadduwm v1,v13,v0
   0x10002ec0 : 86 68 00 10 vcmpequw v0,v0,v13
   0x10002ec4 : d7 48 21 f0 xxperm  vs33,vs33,vs41
=> 0x10002ec8 : 44 04 4d 11 vandc   v10,v13,v0
   0x10002ecc : 84 5c 6a 11 vor v11,v10,v11
   0x10002ed0 : 8d 00 61 f5 stxvvs43,128(r1)
7: $vs42.v4_int32 = {-30, -29, -28, -27}
(gdb) si
   0x10002eb4 :  7d 00 81 f5 stxvvs44,112(r1)
   0x10002eb8 :  d7 48 00 f0 xxperm  vs32,vs32,vs41
   0x10002ebc :  80 00 2d 10 vadduwm v1,v13,v0
   0x10002ec0 : 86 68 00 10 vcmpequw v0,v0,v13
   0x10002ec4 : d7 48 21 f0 xxperm  vs33,vs33,vs41
   0x10002ec8 : 44 04 4d 11 vandc   v10,v13,v0
=> 0x10002ecc : 84 5c 6a 11 vor v11,v10,v11
   0x10002ed0 : 8d 00 61 f5 stxvvs43,128(r1)
7: $vs42.v4_int32 = {0, 0, 0, 0}

[Bug tree-optimization/103029] [12 regression] gcc.dg/vect/pr82436.c ICEs on r12-4818

2021-11-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103029

--- Comment #3 from luoxhu at gcc dot gnu.org ---
This hack could restore the previous phi order to put nondfs phi args before
dfs_edge args.  But I am not sure whether this is the correct direction.  At
least  it proves that the phi order matters for later vectorizer code.


diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index 455c3ef8db9..2ca256c15fa 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimplify-me.h"
 #include "tree-ssa-loop-manip.h"
 #include "dumpfile.h"
+#include "ssa.h"

 static void copy_loops_to (class loop **, int,
   class loop *);
@@ -1577,6 +1578,41 @@ lv_adjust_loop_entry_edge (basic_block first_head,
basic_block second_head,
   e1->probability = then_prob;
   e->probability = else_prob;

+  edge le, dfs = NULL, nondfs = NULL;
+  edge_iterator ei;
+
+  if (EDGE_COUNT (e1->dest->preds) > 1)
+  {
+FOR_EACH_EDGE (le, ei, e1->dest->preds)
+  {
+   if (le->flags & EDGE_DFS_BACK)
+ dfs = le;
+   else
+ nondfs = le;
+  }
+if (dfs && nondfs && dfs->dest_idx < nondfs->dest_idx)
+  {
+   gphi_iterator psi;
+   gphi *phi;
+   tree dfsdef, nondfsdef;
+   for (psi = gsi_start_phis (e1->dest); !gsi_end_p (psi); gsi_next
())
+ {
+   phi = psi.phi ();
+   dfsdef = PHI_ARG_DEF (phi, dfs->dest_idx);
+   nondfsdef = PHI_ARG_DEF (phi, nondfs->dest_idx);
+   SET_PHI_ARG_DEF (phi, dfs->dest_idx, nondfsdef);
+   SET_PHI_ARG_DEF (phi, nondfs->dest_idx, dfsdef);
+ }
+
+   EDGE_PRED (e1->dest, dfs->dest_idx) = nondfs;
+   EDGE_PRED (e1->dest, nondfs->dest_idx) = dfs;
+
+   unsigned int temp = nondfs->dest_idx;
+   nondfs->dest_idx = dfs->dest_idx;
+   dfs->dest_idx = temp;
+  }
+  }
+

[Bug tree-optimization/103029] [12 regression] gcc.dg/vect/pr82436.c ICEs on r12-4818

2021-11-01 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103029

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org,
   ||rguenther at suse dot de

--- Comment #2 from luoxhu at gcc dot gnu.org ---
Confirmed.

P7's extra option -mno-allow-movmisalign makes this ICE happens.  If add this
option on P9 also ICEs. Reason is the phi arguments order changes if switch the
sequence of loopify and lv_adjust_loop_entry_edge.

the constant input argument from bb 18 is in phi index 1 now makes the followed
vectorize code fail to handle?

if (_42 != 0)
  goto ; [80.00%]
else
  goto ; [20.00%]

 [local count: 67276368]:

 [local count: 611603351]:
# i_76 = PHI  // here
# y_lsm.6_74 = PHI <_61(10), 0.0(18)>  // here
# w_lsm.7_73 = PHI <_58(10), 0.0(18)>  // here
i.0_72 = (unsigned int) i_76;
_70 = (long unsigned int) i.0_72;
_69 = _70 * 80;
x_68 = r_22(D) + _69;
fpred_67 = x_68->f_pred;
fexp_66 = x_68->f_exp;
tem_65 = fpred_67 - fexp_66;
_64 = x_68->f_sigma;
_63 = tem_65 / _64;
_62 = ABS_EXPR <_63>;
_61 = _62 + y_lsm.6_74;
_60 = tem_65 / fexp_66;
_59 = ABS_EXPR <_60>;
_58 = _59 + w_lsm.7_73;
i_57 = i_76 + 1;
if (n_19(D) > i_57)
  goto ; [89.00%]
else
  goto ; [11.00%]

 [local count: 544326983]:
goto ; [100.00%]



It was:


if (_42 != 0)
  goto ; [80.00%]
else
  goto ; [20.00%]

 [local count: 67276368]:

 [local count: 611603351]:
# i_76 = PHI <1(18), i_57(10)>   
# y_lsm.6_74 = PHI <0.0(18), _61(10)>
# w_lsm.7_73 = PHI <0.0(18), _58(10)>
i.0_72 = (unsigned int) i_76;
_70 = (long unsigned int) i.0_72;
_69 = _70 * 80;
x_68 = r_22(D) + _69;
fpred_67 = x_68->f_pred;
fexp_66 = x_68->f_exp;
tem_65 = fpred_67 - fexp_66;
_64 = x_68->f_sigma;
_63 = tem_65 / _64;
_62 = ABS_EXPR <_63>;
_61 = _62 + y_lsm.6_74;
_60 = tem_65 / fexp_66;
_59 = ABS_EXPR <_60>;
_58 = _59 + w_lsm.7_73;
i_57 = i_76 + 1;
if (n_19(D) > i_57)
  goto ; [89.00%]
else
  goto ; [11.00%]

 [local count: 544326983]:
goto ; [100.00%]


The comments in function gimple_lv_adjust_loop_header_phi says 

  /* Browse all 'second' basic block phi nodes and add phi args to
 edge 'e' for 'first' head. PHI args are always in correct order.  */ 

Any function to fix the phi order?

[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757

2021-10-31 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991

--- Comment #3 from luoxhu at gcc dot gnu.org ---
(In reply to Kewen Lin from comment #2)
> (In reply to luoxhu from comment #1)
> > Couldn't reproduce on rain6p1 (P10):
> > 
> 
> It's weird, I can reproduce this on rain6p1.
> 
> FAIL: gcc.dg/vect/vect-simd-17.c execution test
> FAIL: gcc.dg/vect/vect-simd-17.c -flto -ffat-lto-objects execution test
> 
> >--->---=== gcc Summary ===
> 
> # of expected passes>--->---2
> # of unexpected failures>---2
> 
> Probably due to you still specified --with-cpu=power9 instead of
> --with-cpu=power10 in gcc configuration?

Thanks, confirmed. --with-cpu=power9 doesn't fail on both P9 and P10 with the
patch.

It aborts at vect-simd-17.c of line 274.

[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757

2021-10-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Couldn't reproduce on rain6p1 (P10):

Test run by luoxhu on Fri Oct 29 04:08:49 2021
Native configuration is powerpc64le-unknown-linux-gnu

=== gcc tests ===

Schedule of variations:
unix

Running target unix
Running /home/luoxhu/workspace/gcc/gcc/testsuite/gcc.dg/vect/vect.exp ...
PASS: gcc.dg/vect/vect-simd-17.c (test for excess errors)
PASS: gcc.dg/vect/vect-simd-17.c execution test
PASS: gcc.dg/vect/vect-simd-17.c -flto -ffat-lto-objects (test for excess
errors)
PASS: gcc.dg/vect/vect-simd-17.c -flto -ffat-lto-objects execution test

=== gcc Summary ===

# of expected passes4
/home/luoxhu/workspace/build/gcc/xgcc  version 12.0.0 20211029 (experimental)
(GCC)

[Bug target/102868] Missed optimization with __builtin_shuffle and zero vector on ppc

2021-10-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102868

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug target/94613] S/390, powerpc: Wrong code generated for vec_sel builtin

2021-10-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94613

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #17 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug target/102868] Missed optimization with __builtin_shuffle and zero vector on ppc

2021-10-24 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102868

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Patch submitted: 
https://gcc.gnu.org/pipermail/gcc-patches/2021-October/582452.html

[Bug target/102868] New: Missed optimization with __builtin_shuffle and zero vector on ppc

2021-10-21 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102868

Bug ID: 102868
   Summary: Missed optimization with __builtin_shuffle and zero
vector on ppc
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoxhu at gcc dot gnu.org
  Target Milestone: ---

Similar to PR94680 and PR100165, PPC currently generates inefficient
instructions for below case:

typedef float V __attribute__((vector_size(16)));
typedef int VI __attribute__((vector_size(16)));
V foo (V x)
{
return __builtin_shuffle (x, (V) { 0, 0, 0, 0 }, (VI) {0, 1, 4, 5});
}


foo:
.LFB0:
.cfi_startproc
.LCF0:
0:  addis 2,12,.TOC.-.LCF0@ha
addi 2,2,.TOC.-.LCF0@l
.localentry foo,.-foo
addis %r9,%r2,.LC0@toc@ha
xxspltib %vs32,0
addi %r9,%r9,.LC0@toc@l
lxv %vs33,0(%r9)
xxperm %vs34,%vs32,%vs33
blr



It will be better to produce:

foo:
.LFB0:
.cfi_startproc
vspltisw %v0,0
xxpermdi %vs34,%vs32,%vs34,3

[Bug target/97142] __builtin_fmod not optimized on POWER

2021-09-13 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #22 from luoxhu at gcc dot gnu.org ---
Fixed on master and backported to gcc-11 and gcc-10.

[Bug tree-optimization/102075] fill_always_executed_in_1 incomplete computation

2021-09-13 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102075

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Fixed by Richard’s r12-3313,  r12-3429 and r12-3430.

[Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2021-09-06 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #2 from luoxhu at gcc dot gnu.org ---
Verified 470.lbm doesn't show regression on Power8 with Ofast.

runtime is 141 sec for r12-897, without that patch it is 142 sec.

[Bug rtl-optimization/102008] [12 Regression] no cmov generated for loads next to each other

2021-09-06 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102008

--- Comment #3 from luoxhu at gcc dot gnu.org ---

phiopt4 and sink2 are doing reverse optimizations:

pr102008.c.200t.phiopt4: 

 Hoisting adjacent loads from 3 and 4 into 2:  _6 = foo_4(D)->a;  _5 =
foo_4(D)->b;

pr102008.c.202t.sink2: 

 Sinking _5 = foo_4(D)->b; from bb 2 to bb 4
 Sinking  _6 = foo_4(D)->a; from bb 2 to bb 3

[Bug rtl-optimization/102008] [12 Regression] no cmov generated for loads next to each other

2021-09-06 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102008

--- Comment #2 from luoxhu at gcc dot gnu.org ---
Confirmed if move the sink2 pass before phiopt4 could restore the previous
instructons for this case:

test:
.LFB0:
.cfi_startproc
cmp w0, 1
ldp w0, w1, [x1]
cselw0, w1, w0, ne
ret
.cfi_endproc



diff --git a/gcc/passes.def b/gcc/passes.def
index 945d2bc797c..83b8310f1ee 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -345,10 +345,10 @@ along with GCC; see the file COPYING3.  If not see
   /* After late CD DCE we rewrite no longer addressed locals into SSA
 form if possible.  */
   NEXT_PASS (pass_forwprop);
+  NEXT_PASS (pass_sink_code);
   NEXT_PASS (pass_phiopt, false /* early_p */);
   NEXT_PASS (pass_fold_builtins);
   NEXT_PASS (pass_optimize_widening_mul);
-  NEXT_PASS (pass_sink_code);
   NEXT_PASS (pass_store_merging);
   NEXT_PASS (pass_tail_calls);


ls *sink*
pr102008.c.139t.sink1  pr102008.c.199t.sink2
ls *phiopt*
pr102008.c.042t.phiopt1  pr102008.c.119t.phiopt2  pr102008.c.131t.phiopt3 
pr102008.c.200t.phiopt4

[Bug target/97142] __builtin_fmod not optimized on POWER

2021-09-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142

--- Comment #15 from luoxhu at gcc dot gnu.org ---
Patch updated:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578740.html

[Bug middle-end/102075] New: fill_always_executed_in_1 incomplete computation

2021-08-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102075

Bug ID: 102075
   Summary: fill_always_executed_in_1 incomplete computation
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoxhu at gcc dot gnu.org
  Target Milestone: ---

ALWAYS_EXECUTED_IN is not computed completely for nested loops.  Current design
will exit if an inner loop doesn't dominate outer loop's latch or exit after
exiting from inner loop, which caused early return from outer loop, then ALWAYS
EXECUTED blocks after inner loops are skipped.

For example, x->k should be move out of outer loop but doesn't.

struct X { int i; int j; int k;};

void foo(struct X *x, int n, int l)
{
  for (int j = 0; j < l; j++)
{
  for (int i = 0; i < n; ++i)
{
  int *p = >j;
  int tem = *p;
  x->j += tem * i;
}
  int *r = >k;
  int tem2 = *r;
  x->k += tem2 * j;
}
}


Discussion lists:

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577444.html

[Bug tree-optimization/101250] adjust_iv_update_pos update the iv statement unexpectedly cause memory address offset mismatch

2021-07-06 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101250

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Patch posted:

[PATCH] ivopts: Don't adjust IV update statement if both operands use the IV in
COND [PR101250]

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573894.html

[Bug middle-end/101250] New: adjust_iv_update_pos update the iv statement unexpectedly cause memory address offset mismatch

2021-06-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101250

Bug ID: 101250
   Summary: adjust_iv_update_pos update the iv statement
unexpectedly cause memory address offset mismatch
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoxhu at gcc dot gnu.org
  Target Milestone: ---

Test case:

unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
{
  unsigned int len = 2;
  do {
  len++;
  }while(len < maxlen && ip[len] == ref[len]);
  return len;
}


ivopts:

   [local count: 1014686026]:
  _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.16_15 * 1];
  ivtmp.16_16 = ivtmp.16_15 + 1;
  _19 = ref_12(D) + 18446744073709551615;
  _6 = MEM[(unsigned char *)_19 + ivtmp.16_16 * 1];
  if (_3 == _6)
goto ; [94.50%]
  else
goto ; [5.50%]

Disable adjust_iv_update_pos will produce:

   [local count: 1014686026]:
  _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.16_15 * 1];
  _6 = MEM[(unsigned char *)ref_12(D) + ivtmp.16_15 * 1];
  ivtmp.16_16 = ivtmp.16_15 + 1;
  if (_3 == _6)
goto ; [94.50%]
  else
goto ; [5.50%]


discussions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573709.html

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-21 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #13 from luoxhu at gcc dot gnu.org ---
It is not visible in combine due to the constant data is in *.LC0 and
UNSPEC_VPERM. Will shelf this and switch to other high priority issues.

pr100866.c.277r.combine:

(note 4 0 20 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 20 4 2 2 (set (reg:V8HI 126)
(reg:V8HI 66 %v2 [ a ])) "pr100866.c":18:1 1132 {vsx_movv8hi_64bit}
 (expr_list:REG_DEAD (reg:V8HI 66 %v2 [ a ])
(nil)))
(note 2 20 3 2 NOTE_INSN_DELETED)
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 18 2 (set (reg/f:DI 122)
(unspec:DI [
(symbol_ref/u:DI ("*.LC0") [flags 0x82])
(reg:DI 2 %r2)
] UNSPEC_TOCREL)) "pr100866.c":19:13 719 {*tocrefdi}
 (expr_list:REG_EQUAL (symbol_ref/u:DI ("*.LC0") [flags 0x82])
(nil)))
(insn 18 6 9 2 (set (reg:V16QI 123)
(mem/u/c:V16QI (and:DI (reg/f:DI 122)
(const_int -16 [0xfff0])) [0  S16 A128]))
"pr100866.c":19:13 1131 {vsx_movv16qi_64bit}
 (expr_list:REG_DEAD (reg/f:DI 122)
(nil)))
(insn 9 18 10 2 (set (reg:V16QI 124)
(not:V16QI (reg:V16QI 123))) "pr100866.c":19:13 508 {one_cmplv16qi2}
 (expr_list:REG_DEAD (reg:V16QI 123)
(nil)))
(note 10 9 15 2 NOTE_INSN_DELETED)
(insn 15 10 16 2 (set (reg/i:V8HI 66 %v2)
(unspec:V8HI [
(reg:V8HI 126) repeated x2
(reg:V16QI 124)
] UNSPEC_VPERM)) "pr100866.c":20:1 1830 {altivec_vperm_v8hi_direct}
 (expr_list:REG_DEAD (reg:V16QI 124)
(expr_list:REG_DEAD (reg:V8HI 126)
(nil
(insn 16 15 0 2 (use (reg/i:V8HI 66 %v2)) "pr100866.c":20:1 -1
 (nil))

;; Combiner totals: 12 attempts, 12 substitutions (2 requiring new space),

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-20 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #8 from luoxhu at gcc dot gnu.org ---
(In reply to Jens Seifert from comment #7)
> Regarding vec_revb for vector unsigned int. I agree that
> revb:
> .LFB0:
> .cfi_startproc
> vspltish %v1,8
> vspltisw %v0,-16
> vrlh %v2,%v2,%v1
> vrlw %v2,%v2,%v0
> blr
> 
> works. But in this case, I would prefer the vperm approach assuming that the
> loaded constant for the permute vector can be re-used multiple times.
> But please get rid of the xxlnor 32,32,32. That does not make sense after
> loading a constant. Change the constant that need to be loaded.

xxlnor is LE specific requirement(not existed if build with -mbig), we need to
turn the index {0,1,2,3} to {31, 30,29,28} for vperm usage, it is required
otherwise produces incorrect result:

 6|0x1630 <+16>:lvx v0,0,r9
 7+>   0x1634 <+20>:xxlnor  vs32,vs32,vs32
 8|0x1638 <+24>:vperm   v2,v2,v2,v0
 9|0x163c <+28>:blr

(gdb)
0x1634 in revb ()
2: /x $vs34.uint128 = 0x42345678323456782234567812345678
5: /x $vs32.uint128 = 0xc0d0e0f08090a0b0405060700010203
(gdb) si
0x1638 in revb ()
2: /x $vs34.uint128 = 0x42345678323456782234567812345678
5: /x $vs32.uint128 = 0xf3f2f1f0f7f6f5f4fbfaf9f8fffefdfc
(gdb) si
0x163c in revb ()
2: /x $vs34.uint128 = 0x78563442785634327856342278563412
5: /x $vs32.uint128 = 0xf3f2f1f0f7f6f5f4fbfaf9f8fffefdfc



Quoted from the ISA:

vperm VRT,VRA,VRB,VRC

vsrc.qword[0] ← VSR[VRA+32]
vsrc.qword[1] ← VSR[VRB+32]
do i = 0 to 15
index ← VSR[VRC+32].byte[i].bit[3:7]
VSR[VRT+32].byte[i] ← src.byte[index]
end

Let the source vector be the concatenation of the
contents of VSR[VRA+32] followed by the contents of
VSR[VRB+32].
For each integer value i from 0 to 15, do the following.
Let index be the value specified by bits 3:7 of byte
element i of VSR[VRC+32].
The contents of byte element index of src are
placed into byte element i of VSR[VRT+32].

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-17 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #6 from luoxhu at gcc dot gnu.org ---
For V4SI, it is also better to use vector splat and vector rotate operations.

revb:
.LFB0:
.cfi_startproc
vspltish %v1,8
vspltisw %v0,-16
vrlh %v2,%v2,%v1
vrlw %v2,%v2,%v0
blr


Performance improved from 7.322s to 2.445s with a small benchmark due to load
instruction replaced.

But for V2DI, we don't have "vspltisd" to splat {32,32} to vector register
before Power9, so lvx is still required?

vector unsigned long long revb_pwr7_l(vector unsigned long long a)
{
 return vec_rl(a, vec_splats((unsigned long long)32));
} 

generates:

revb_pwr7_l:
.LFB1:
.cfi_startproc
.LCF1:
0:  addis 2,12,.TOC.-.LCF1@ha
addi 2,2,.TOC.-.LCF1@l
.localentry revb_pwr7_l,.-revb_pwr7_l
addis %r9,%r2,.LC0@toc@ha
addi %r9,%r9,.LC0@toc@l
lvx %v0,0,%r9
vrld %v2,%v2,%v0
blr
.LC0:
.quad   32
.quad   32
.align 4

[Bug target/93571] PPC: fmr gets used instead of faster xxlor

2021-06-16 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571

--- Comment #3 from luoxhu at gcc dot gnu.org ---
BTW, I didn't see performance difference between fmr and xxlor within a small
benchmark.

   Max Ops Per CycleLatency (Min)   Latency (Max)   

fmr -   -   ALU FPR 4   2  
2   1   R   -   -   -   -  
Floating Move Register  


xxlor   -   -   ALU VSR 2   2  
2   1   V   -   1   S   -   -  
VSX Vector Logical OR

[Bug target/93571] PPC: fmr gets used instead of faster xxlor

2021-06-16 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #2 from luoxhu at gcc dot gnu.org ---
It is generated by "*mov_hardfloat64" (i.e. {*movdf_hardfloat64}), switch
the constraint of fmr and xxlor could generate expected code, is that correct?

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #5 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #4)
> This PR is specifically about the vec_revb builtin.  But yes, we should
> look at what is generated for all other code (having only the builtin
> generate good code is suboptimal for a generic thing like this), and for
> other sizes as well.

Sorry I don't quite understand what you mean. IMO vec_revb is expanded by
CODE_FOR_revb_v8hi through revb_ pattern. So this is where we should
change to make better code generation... 
For V8HI, it is natural to use vspltish 8+vrlh to turn
{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} to
{1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14}.

But for V4SI, we need use vspltish+vrlh to turn it to
{1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14} first, and a "vrlw 16" to turn it to 
{3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12}. I am not sure whether this is better
than lvx+xxlnor+vperm especially for V2DI with additional "vrld 32" or
"vrld 32"+"vrlq 64"? (Those are all operations on register without load from
memory like lvx.)


bt 5
#0  gen_revb_v8hi (operand0=0x74d4ce40, operand1=0x74d4cf60) at
../../gcc/gcc/config/rs6000/vsx.md:5858
#1  0x10b05360 in insn_gen_fn::operator()
(this=0x130ab188 ) at../../gcc/gcc/recog.h:407
#2  0x11aa1e30 in rs6000_expand_unop_builtin (icode=CODE_FOR_revb_v8hi,
exp=
, target=0x74d4ce40) at ../../gcc/gcc/config/rs6000/rs6000-call.c:9451
#3  0x11ab27a4 in rs6000_expand_builtin (exp=, target=0x74d4ce40, subtarget=0x0, mode=E_V8HImode,
ignore=0) at ../../gcc/gcc/config/rs6000/rs6000-call.c:13157
#4  0x10815268 in expand_builtin (exp=,
target=0x74d4ce40, subtarget=0x0, mode=E_V8HImode, ignore=0) at
../../gcc/gcc/builtins.c:9559

[Bug testsuite/101020] [12 regression] Several test case failures after r12-1316

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101020

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Fixed.

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #3 from luoxhu at gcc dot gnu.org ---

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 097a127be07..35b3f1a0e1a 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -1932,7 +1932,7 @@ (define_insn "altivec_vpkuum_direct"
 }
   [(set_attr "type" "vecperm")])

-(define_insn "*altivec_vrl"
+(define_insn "altivec_vrl"
   [(set (match_operand:VI2 0 "register_operand" "=v")
 (rotate:VI2 (match_operand:VI2 1 "register_operand" "v")
(match_operand:VI2 2 "register_operand" "v")))]
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 8c5865b8c34..88b34a2285a 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5849,9 +5849,18 @@ (define_expand "revb_"
   /* Want to have the elements in reverse order relative
 to the endian mode in use, i.e. in LE mode, put elements
 in BE order.  */
-  rtx sel = swap_endian_selector_for_mode(mode);
-  emit_insn (gen_altivec_vperm_ (operands[0], operands[1],
-  operands[1], sel));
+  if (mode == V8HImode)
+   {
+ rtx splt = gen_reg_rtx (V8HImode);
+ emit_insn (gen_altivec_vspltish (splt, GEN_INT (8)));
+ emit_insn (gen_altivec_vrlh (operands[0], operands[1], splt));
+   }
+  else
+   {
+ rtx sel = swap_endian_selector_for_mode ( mode);
+ emit_insn (gen_altivec_vperm_ (operands[0], operands[1],
+  operands[1], sel));
+   }
 }


With above change, it could generate the expected code:

revb:
.LFB0:
.cfi_startproc
vspltisw 0,8
vrlw 2,2,0
blr

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #2 from luoxhu at gcc dot gnu.org ---
But it only works for V8HImode, no better code generation for other modes like
V4SI/V2DI/V1TI to do byte swap with only two instructions vspltish+vrlh?

  unsigned int swap1[16] = {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0};
  unsigned int swap2[16] = {7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8};
  unsigned int swap4[16] = {3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12};
  unsigned int swap8[16] = {1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14};

For example V4SI, need swap short first,  then swap word, it seems not so
straight forward than vperm?

[Bug testsuite/101020] [12 regression] Several test case failures after r12-1316

2021-06-10 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101020

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org,
   ||segher at kernel dot 
crashing.org

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Confirmed. The BE-m32 test is a nightmare to me... :(

For float128-call.c, need check target BE or LE.
And for pr100085.c, vector __int128 is not supported with {-m32}, just skip it.
Ok to trunk?


[PATCH] rs6000: Fix test case failures by PR100085 [PR101020]

gcc/testsuite/ChangeLog:

PR target/101020
* gcc.target/powerpc/float128-call.c: Adjust.
* gcc.target/powerpc/pr100085.c: Likewise.
---
 gcc/testsuite/gcc.target/powerpc/float128-call.c | 6 --
 gcc/testsuite/gcc.target/powerpc/pr100085.c  | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.target/powerpc/float128-call.c
b/gcc/testsuite/gcc.target/powerpc/float128-call.c
index a1f09df..b64ffc6 100644
--- a/gcc/testsuite/gcc.target/powerpc/float128-call.c
+++ b/gcc/testsuite/gcc.target/powerpc/float128-call.c
@@ -21,5 +21,7 @@
 TYPE one (void) { return ONE; }
 void store (TYPE a, TYPE *p) { *p = a; }

-/* { dg-final { scan-assembler "lvx 2"  } } */
-/* { dg-final { scan-assembler "stvx 2" } } */
+/* { dg-final { scan-assembler {\mlxvd2x 34\M} {target be} } } */
+/* { dg-final { scan-assembler {\mstxvd2x 34\M} {target be} } } */
+/* { dg-final { scan-assembler {\mlvx 2\M} {target le} } }  */
+/* { dg-final { scan-assembler {\mstvx 2\M} {target le} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr100085.c
b/gcc/testsuite/gcc.target/powerpc/pr100085.c
index 7d8b147..b6738ea 100644
--- a/gcc/testsuite/gcc.target/powerpc/pr100085.c
+++ b/gcc/testsuite/gcc.target/powerpc/pr100085.c
@@ -1,4 +1,4 @@
-/* { dg-do compile } */
+/* { dg-do compile {target lp64} } */
 /* { dg-options "-O2 -mdejagnu-cpu=power8" } */

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-06-08 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #10 from luoxhu at gcc dot gnu.org ---
float128 to vector __int128 is fixed by:

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=f700e4b0ee3ef53b48975cf89be26b9177e3a3f3

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-06-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #9 from luoxhu at gcc dot gnu.org ---
Patch sent, it could fix the __float128 to vector __int128 issue, 

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571689.html


But for __float128 to __int128 mentioned in #c4, need hack
rs6000_modes_tieable_p
to remove the stack operation in dse1. But I am not sure this is *LEGAL* since
TImode is allocated to GPR, It seems not true to access TImode from ALTIVEC or
VSX without copying?

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index ad11b67b125..ee69463ac46 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1974,6 +1974,9 @@ rs6000_modes_tieable_p (machine_mode mode1, machine_mode
mode2)
   || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode)
 return mode1 == mode2;

+  if (mode1 == TImode && ALTIVEC_OR_VSX_VECTOR_MODE (mode2))
+return true;
+


xxpermdi %vs0,%vs34,%vs34,3
mfvsrd %r4,%vs34
mfvsrd %r3,%vs0

[Bug target/94613] S/390, powerpc: Wrong code generated for vec_sel builtin

2021-05-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94613

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #14 from luoxhu at gcc dot gnu.org ---
Patch submmited:

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/569255.html

[Bug target/97142] __builtin_fmod not optimized on POWER

2021-05-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142

--- Comment #12 from luoxhu at gcc dot gnu.org ---
Patch submitted:

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568143.html

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-05-24 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #7 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #3)
> The rotates in 6 and 7 are not merged, and neither are the vec_selects in
> 8 and 9.  Both should be pretty easy to do, there is no unspec in sight,
> etc.

Should this be done in pass bswaps or combine or by peephole2? :)

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #17 from luoxhu at gcc dot gnu.org ---
If the constant limitation is removed, it could be combined successfully with
my new patch for PR94613.

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/569255.html

And what do you mean"This is not canonical form on RTL, and it's not a useful
form either" in c#7, please? Not understanding the point...


Trying 11 -> 16:
   11: r124:V4SI=r127:V4SI:V4SI|~r129:V4SI:V4SI
  REG_DEAD r128:V4SI
  REG_DEAD r129:V4SI
  REG_DEAD r127:V4SI
   16: %v2:V4SI=r124:V4SI
  REG_DEAD r124:V4SI
Successfully matched this instruction:
(set (reg/i:V4SI 66 %v2)
(ior:V4SI (and:V4SI (reg:V4SI 127)
(reg:V4SI 129))
(and:V4SI (not:V4SI (reg:V4SI 129))
(reg:V4SI 128
allowing combination of insns 11 and 16
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 11.
modifying insn i316: %v2:V4SI=r127:V4SI:V4SI|~r129:V4SI:V4SI
  REG_DEAD r127:V4SI
  REG_DEAD r129:V4SI
  REG_DEAD r128:V4SI
deferring rescan insn with uid = 16.


diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 571e2337e27..701f37eb03e 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -3405,7 +3405,6 @@ simplify_context::simplify_binary_operation_1 (rtx_code
code,
 machines, and also has shorter instruction path length.  */
   if (GET_CODE (op0) == AND
  && GET_CODE (XEXP (op0, 0)) == XOR
- && CONST_INT_P (XEXP (op0, 1))
  && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1))
{
  rtx a = trueop1;
@@ -3419,7 +3418,6 @@ simplify_context::simplify_binary_operation_1 (rtx_code
code,
   /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) 
*/
   else if (GET_CODE (op0) == AND
  && GET_CODE (XEXP (op0, 0)) == XOR
- && CONST_INT_P (XEXP (op0, 1))
  && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1))
{
  rtx a = XEXP (XEXP (op0, 0), 0);

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #16 from luoxhu at gcc dot gnu.org ---

> +2016-11-09  Segher Boessenkool  
> +
> +   * simplify-rtx.c (simplify_binary_operation_1): Simplify
> +   (xor (and (xor A B) C) B) to (ior (and A C) (and B ~C)) and
> +   (xor (and (xor A B) C) A) to (ior (and A ~C) (and B C)) if C
> +   is a const_int.


Is it a MUST that C be const here? For this case in PR90323, C is not a const 
actually.

l = l & ~mask;
l |= mask & r;

Trying 8, 9 -> 10:
8: r127:V4SI=r124:V4SI^r131:V4SI
  REG_DEAD r131:V4SI
9: r122:V4SI=r127:V4SI:V4SI
  REG_DEAD r130:V4SI
  REG_DEAD r127:V4SI
   10: r128:V4SI=r124:V4SI^r122:V4SI
  REG_DEAD r124:V4SI
  REG_DEAD r122:V4SI

[Bug target/97142] __builtin_fmod not optimized on POWER

2021-04-13 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142

--- Comment #10 from luoxhu at gcc dot gnu.org ---

If not built with fast-math, gimple_has_side_effects will return true and cause
the expand_call_stmt fail to expand the "_1 = fmod (x_2(D), y_3(D));" to
internal function. X86 also produces "bl fmod" for O3 build.


xlF expands the fmod to below ASM, no FMA generated?


1900 :
1900:   8c 03 01 10 vspltisw v0,1
1904:   00 00 24 c8 lfd f1,0(r4)
1908:   00 00 03 c8 lfd f0,0(r3)
190c:   e2 03 40 f0 xvcvsxwdp vs2,vs32
1910:   c0 09 62 f0 xsdivdp vs3,vs2,vs1
1914:   80 19 80 f0 xsmuldp vs4,vs0,vs3
1918:   64 21 a0 f0 xsrdpiz vs5,vs4
191c:   88 2d 01 f0 xsnmsubadp vs0,vs1,vs5
1920:   18 00 20 fc frspf1,f0
1924:   20 00 80 4e blr

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-12 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #15 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #14)
> (In reply to luoxhu from comment #12)
> > That code was called by combine pass but fail to match. 
> 
> > 
> > pr newpat
> > (set (reg:DI 125 [ l ])
> > (xor:DI (and:DI (xor:DI (reg/v:DI 120 [ l ])
> > (reg:DI 127))
> > (const_int 267390975 [0xff00fff]))
> > (reg/v:DI 120 [ l ])))
> 
> Note this is 0x0ff00fff, and this is not a valid mask for rlwimi.

OK, it also fails to combine for 0x0100.


.cfi_startproc
xor 4,3,4
rlwinm 4,4,0,7,7
xor 3,4,3
blr

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-09 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #12 from luoxhu at gcc dot gnu.org ---

That code was called by combine pass but fail to match. 

pr newpat
(set (reg:DI 125 [ l ])
(xor:DI (and:DI (xor:DI (reg/v:DI 120 [ l ])
(reg:DI 127))
(const_int 267390975 [0xff00fff]))
(reg/v:DI 120 [ l ])))


Trying 8, 10 -> 11:
8: r123:DI=r120:DI^r127:DI
  REG_DEAD r127:DI
   10: r118:DI=r123:DI&0xff00fff
  REG_DEAD r123:DI
   11: r125:DI=r118:DI^r120:DI
  REG_DEAD r120:DI
  REG_DEAD r118:DI
Failed to match this instruction:
(set (reg:DI 125 [ l ])
(ior:DI (and:DI (reg/v:DI 120 [ l ])
(const_int -267390976 [0xf00ff000]))
(and:DI (reg:DI 127)
(const_int 267390975 [0xff00fff]
Successfully matched this instruction:
(set (reg:DI 118 [ _2 ])
(and:DI (reg:DI 127)
(const_int 267390975 [0xff00fff])))
Failed to match this instruction:
(set (reg:DI 125 [ l ])
(ior:DI (and:DI (reg/v:DI 120 [ l ])
(const_int -267390976 [0xf00ff000]))
(reg:DI 118 [ _2 ])))

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-08 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #11 from luoxhu at gcc dot gnu.org ---
I noticed that you added the below optimization with commit
a62436c0a505155fc8becac07a8c0abe2c265bfe. But it doesn't even handle this case,
cse1 pass will call simplify_binary_operation_1, both op0 and op1 are REGs
instead of AND operators, do you have a test case to cover that piece of code?

__attribute__ ((noinline))
 long without_sel3( long l,  long r) {
long tmp = {0x0ff00fff};
l =  ( (l ^ r) & tmp) ^ l;
return l;
}


without_sel3:
xor 4,3,4
rlwinm 4,4,0,20,11
rldicl 4,4,0,36
xor 3,4,3
blr
.long 0
.byte 0,0,0,0,0,0,0,0


+2016-11-09  Segher Boessenkool  
+
+   * simplify-rtx.c (simplify_binary_operation_1): Simplify
+   (xor (and (xor A B) C) B) to (ior (and A C) (and B ~C)) and
+   (xor (and (xor A B) C) A) to (ior (and A ~C) (and B C)) if C
+   is a const_int.

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 5c3dea1a349..11a2e0267c7 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -2886,6 +2886,37 @@ simplify_binary_operation_1 (enum rtx_code code,
machine_mode mode,
}
}

+  /* If we have (xor (and (xor A B) C) A) with C a constant we can instead
+do (ior (and A ~C) (and B C)) which is a machine instruction on some
+machines, and also has shorter instruction path length.  */
+  if (GET_CODE (op0) == AND
+ && GET_CODE (XEXP (op0, 0)) == XOR
+ && CONST_INT_P (XEXP (op0, 1))
+ && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1))
+   {
+ rtx a = trueop1;
+ rtx b = XEXP (XEXP (op0, 0), 1);
+ rtx c = XEXP (op0, 1);
+ rtx nc = simplify_gen_unary (NOT, mode, c, mode);
+ rtx a_nc = simplify_gen_binary (AND, mode, a, nc);
+ rtx bc = simplify_gen_binary (AND, mode, b, c);
+ return simplify_gen_binary (IOR, mode, a_nc, bc);
+   }
+  /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) 
*/
+  else if (GET_CODE (op0) == AND
+ && GET_CODE (XEXP (op0, 0)) == XOR
+ && CONST_INT_P (XEXP (op0, 1))
+ && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1))
+   {
+ rtx a = XEXP (XEXP (op0, 0), 0);
+ rtx b = trueop1;
+ rtx c = XEXP (op0, 1);
+ rtx nc = simplify_gen_unary (NOT, mode, c, mode);
+ rtx b_nc = simplify_gen_binary (AND, mode, b, nc);
+ rtx ac = simplify_gen_binary (AND, mode, a, c);
+ return simplify_gen_binary (IOR, mode, ac, b_nc);
+   }

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-07 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #9 from luoxhu at gcc dot gnu.org ---
Then we could optimized it in match.pd

diff --git a/gcc/match.pd b/gcc/match.pd
index 036f92fa959..8944312c153 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3711,6 +3711,17 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (integer_all_onesp (@1) && integer_zerop (@2))
 @0

+#if GIMPLE
+(simplify
+ (bit_xor @0 (bit_and @2 (bit_xor @0 @1)))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @1)
+  && types_match (@0, @2) && VECTOR_TYPE_P (TREE_TYPE (@0))
+  && VECTOR_TYPE_P (TREE_TYPE (@1)) && VECTOR_TYPE_P (TREE_TYPE (@2)))
+ (with { tree itype = truth_type_for (type); }
+ (vec_cond (convert:itype @2) @1 @0
+#endif

in pr90323.c.033t.forwprop1, it will be optimized to:

   :
  _1 = ~mask_3(D);
  l_5 = _1 & l_4(D);
  _2 = mask_3(D) & r_6(D);
  _8 = l_4(D) ^ r_6(D);
  _10 = mask_3(D) & _8;
  _11 = (vector(4) ) mask_3(D);
  l_7 = VEC_COND_EXPR <_11, r_6(D), l_4(D)>;
  return l_7;

Then in pr90323.c.243t.isel:

   [local count: 1073741824]:
  _6 = (vector(4) ) mask_1(D);
  l_4 = .VCOND_MASK (_6, r_3(D), l_2(D));
  return l_4;

final ASM:

without_sel:
.LFB11:
.cfi_startproc
xxsel 34,34,35,36
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
.LFE11:
.size   without_sel,.-without_sel
.align 2
.p2align 4,,15
.globl with_sel
.type   with_sel, @function
with_sel:
.LFB12:
.cfi_startproc
xxsel 34,34,35,36
blr


@segher, Is this reasonable fix ???

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-07 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #8 from luoxhu at gcc dot gnu.org ---
Two minor updates for the case mentioned in #c2:

 for VEC_SEL (ARG1, ARG2, ARG3):

   Returns a vector containing the value of either ARG1 or ARG2 depending on
the 
   value of ARG3.


#include 
#include 
volatile vector unsigned orig = {0xebebebeb, 0x34343434, 0x76767676,
0x12121212};
volatile vector unsigned mask = {0x, 0, 0x, 0};
volatile vector unsigned fill = {0xfefefefe, 0x, 0x,
0x};
volatile vector unsigned expected = {0xfefefefe, 0x34343434, 0x,
0x12121212};
__attribute__ ((noinline))
vector unsigned without_sel(vector unsigned l, vector unsigned r, vector
unsigned mask) {
-l = l & ~r;
+l = l & ~mask;
l |= mask & r;
return l;
}

__attribute__ ((noinline))
vector unsigned with_sel(vector unsigned l, vector unsigned r, vector unsigned
mask) {
-return vec_sel(l, mask, r);
+return vec_sel(l, r, mask);
}

int main() {
vector unsigned res1 = without_sel(orig, fill, mask);
vector unsigned res2 = with_sel(orig, fill, mask);
if (!vec_all_eq(res1, expected)) printf ("error1\n");
if (!vec_all_eq(res2, expected)) printf ("error2\n");
return 0;
}


And the ASM would be:

without_sel:
xxlxor 35,34,35
xxland 35,35,36
xxlxor 34,34,35
blr
.long 0
.byte 0,0,0,0,0,0,0,0
with_sel:
xxsel 34,34,35,36
blr
.long 0
.byte 0,0,0,0,0,0,0,0

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #21 from luoxhu at gcc dot gnu.org ---
Fixed on mater.

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #19 from luoxhu at gcc dot gnu.org ---
https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567395.html

This patch extends variable vec_insert to all 32bit VSX targets including
Power7{BE} {32,64}, Power8{BE}{32, 64}, Power8{LE}{64}, Power9{LE}{64}, all
tested  pass for power testcases, though AIX is not tested yet. @Segher, please
review this one instead of the previous that disables 32 bit variable
vec_insert, thanks.

For Altivec targets like power5/6/G4/G5, take the previous "vector store/scalar
store/vector load" code path.

-mcpu=power6 -O2 -maltivec -c -S

f2:
.LFB0:
.cfi_startproc
addi 10,1,-16
sldi 5,5,2
li 9,32
addi 8,1,-48
stvx 2,8,9
stwx 6,10,5
lvx 2,8,9
blr

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #15 from luoxhu at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #14)
> You still have:
>   if (VECTOR_MEM_VSX_P (mode))
> {
>   if (!CONST_INT_P (elt_rtx))
> {
>   if ((TARGET_P9_VECTOR && TARGET_POWERPC64) || width == 8)
> return ..._p9 (...);
>   else if (TARGET_P8_VECTOR)
> return ..._p8 (...);
> }
> 
>   if (mode == V2DFmode)
> insn = gen_vsx_set_v2df (target, target, val, elt_rtx);
> 
>   else if (mode == V2DImode)
> insn = gen_vsx_set_v2di (target, target, val, elt_rtx);
> 
>   else if (TARGET_P9_VECTOR && TARGET_POWERPC64)
> {
>   ...
> }
>   if (insn)
> return;
> }
> 
>   gcc_assert (CONST_INT_P (elt_rtx));
> 
> while the vector.md condition is VECTOR_MEM_ALTIVEC_OR_VSX_P (mode),
> i.e. true for TARGET_ALTIVEC for many modes already (V4SI, V8HI, V16QI, V4SF
> and
> for TARGET_VSX also V2DF and V2DI, right).
> I somehow don't see how this can work properly.
> Looking at vsx_set_v2df and vsx_set_v2di, neither of them will handle
> non-constant elt_rtx (it ICEs on anything but const0_rtx and const1_rtx).
> 
> So, questions:
> 1) does the rs6000_expand_vector_set_var_p9 routine for width == 8 (i.e.
> V2DImode or V2DFmode?)
> handle everything, even when TARGET_P9_VECTOR or TARGET_POWERPC64 is not
> true, plain old VSX?

Yes. V2DI/V2DF for P8 {BE,LE} {m32,m64} will call
rs6000_expand_vector_set_var_p9 instead of xxx_p8. 

Do you mean Power7 for the plain old VSX? I verified the pr98914.c on Power7,
it exactly ICEs on "gcc_assert (CONST_INT_P (elt_rtx));" for both m64 and m32. 
This is still not fixed by the patch in #c11 yet.

For builtin call in rs6000-c.c:altivec_build_resolved_builtin, it is guarded by
TARGET_P8_VECTOR, so Power7 doesn't generate IFN VEC_INSERT before. This ICE
also comes from internal optimization gimple-isel.c:gimple_expand_vec_set_expr,
can_vec_set_var_idx_p doesn't return false due to VECTOR_MEM_ALTIVEC_OR_VSX_P
is true when Power7 VSX, change the "if (VECTOR_MEM_VSX_P (mode))" to "if
(VECTOR_MEM_ALTIVEC_OR_VSX_P (mode))" in rs6000.c:rs6000_expand_vector_set and
remove TARGET_P8_VECTOR in the else branch could fix the ICE on P7 {m32,64}, so
this means even P7 VSX could benefit from this optimization, which is different
from what discussed before.


> 2) what happens if TARGET_P8_VECTOR is false and TARGET_VSX is true and mode
> is other than V2DI/V2DF? If I read the code right, it will fall through to
> gcc_assert (CONST_INT_P (elt_rtx));

Same like 1)?

> 3) what happens if !TARGET_VSX (more specifically, when VECTOR_MEM_VSX_P
> (mode) is false.
> I see there just the assertion that would fail right away.
> Perhaps I'm missing something obvious and those cases are impossible, but if
> that is the case, it would still be better to add further assertion at least
> to the if (...) else if (...) as else gcc_assert ...

Thanks for pointing out, the "gcc_assert (CONST_INT_P (elt_rtx));" should be
moved into the "if (!CONST_INT_P (elt_rtx))" condition like you said. 
gen_vsx_set_v2df and gen_vsx_set_v2di are supposed to handle only const
elt_rtx.

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #13 from luoxhu at gcc dot gnu.org ---
Performance data in #c11 is for int variable vec_insert of 32bit mode, the
float variable vec_insert of 32-bit is a bit slower but much better than
original(extra stfs+lwz of insn #17 and insn 18 in expand to move SF register
to SI register by hex value.):

46.677s -> 8.723s

test.c

#include 
#define TYPE float

vector TYPE
test (vector TYPE u, TYPE i, signed int n){
return vec_insert (i, u, n);
}

Expand:
1: NOTE_INSN_DELETED
6: NOTE_INSN_BASIC_BLOCK 2
2: r122:V4SF=%2:V4SF
3: r123:SF=%1:SF
4: r124:SI=%3:SI
5: NOTE_INSN_FUNCTION_BEG
8: r120:V4SF=r122:V4SF
9: r125:SI=r124:SI&0x3
   10: r126:V4SF=r120:V4SF
   11: r128:SI=r125:SI<<0x2
   12: {r128:SI=0x14-r128:SI;clobber ca:SI;}
   13: r132:SI=high(`*.LC0')
   14: r131:SI=r132:SI+low(`*.LC0')
  REG_EQUAL `*.LC0'
   15: r130:V2DI=[r131:SI]
  REG_EQUAL const_vector
   16: r129:V16QI=r130:V2DI#0
   17: [r112:SI]=r123:SF
   18: r133:SI=[r112:SI]
   19: r136:DI#4=r133:SI
   22: {r137:SI=r133:SI>>0x1f;clobber ca:SI;}
   23: r136:DI#0=r137:SI
   24: r138:DI=0
   25: r135:V2DI=vec_concat(r136:DI,r138:DI)
   26: r134:V16QI=r135:V2DI#0
   27: r139:V16QI=unspec[r128:SI] 151
   28: r140:V16QI=unspec[r134:V16QI,r134:V16QI,r139:V16QI] 236
   29: r141:V16QI=unspec[r129:V16QI,r129:V16QI,r139:V16QI] 236
   30: r126:V4SF#0={(r141:V16QI!=const_vector)?r140:V16QI:r126:V4SF#0}
   31: r119:V4SF=r126:V4SF
   32: r120:V4SF=r119:V4SF

ASM:

.LFB0:
.cfi_startproc
stwu 1,-16(1)
.cfi_def_cfa_offset 16
lis 9,.LC0@ha
rlwinm 3,3,2,28,29
xxlxor 0,0,0
la 9,.LC0@l(9)
subfic 3,3,20
lxvd2x 33,0,9
lvsl 13,0,3
stfs 1,8(1)
vperm 1,1,1,13
ori 2,2,0
lwz 9,8(1)
addi 1,1,16
.cfi_def_cfa_offset 0
srawi 10,9,31
mtvsrwz 13,9
mtvsrwz 12,10
fmrgow 11,12,13
xxpermdi 32,11,0,0
vperm 0,0,0,13
xxsel 34,34,32,33
blr

  1   2   >