https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290

--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 13 Aug 2025, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290
> 
> --- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> In gimple that's
> 
>   <bb 10> [local count: 108459]:
>   x_22 = a[0];
>   _69 = {x_22, x_22, x_22, x_22};
> 
>   <bb 4> [local count: 10737416]:
>   # ivtmp_83 = PHI <ivtmp_84(11), 0(10)>
> 
>   <bb 5> [local count: 1063004408]:
>   # i_43 = PHI <i_24(12), 0(4)>
>   # ivtmp_34 = PHI <ivtmp_33(12), 32000(4)>
>   # vect_x_36.8_70 = PHI <vect_x_9.10_72(12), _69(4)>
>   # vect_vec_iv_.12_76 = PHI <_77(12), { 0, 0, 0, 0 }(4)>
>   # vect_index_39.13_78 = PHI <vect_index_12.14_79(12), { 0, 0, 0, 0 }(4)>
>   _4 = a[i_43];
>   vect_cst__68 = {_4, _4, _4, _4};
>   mask__16.9_71 = vect_cst__68 > vect_x_36.8_70;
>   vect_index_12.14_79 = VEC_COND_EXPR <mask__16.9_71, vect_vec_iv_.12_76,
> vect_index_39.13_78>;
>   vect_x_9.10_72 = VEC_COND_EXPR <mask__16.9_71, vect_cst__68, 
> vect_x_36.8_70>;
>   i_24 = i_43 + 1;
>   ivtmp_33 = ivtmp_34 - 1;
>   _77 = vect_vec_iv_.12_76 + { 1, 1, 1, 1 };
>   if (ivtmp_33 != 0)
>     goto <bb 12>; [98.99%]
>   else
>     goto <bb 8>; [1.01%]
> 
> The SLP tree seems to mostly be working on lanes of externals:
> 
> note:   Vectorizing SLP tree:
> note:   node 0x42900120 (max_nunits=4, refcnt=1) vector(4) float
> note:   op template: x_41 = PHI <x_9(5)>
> note:           [l] stmt 0 x_41 = PHI <x_9(5)>
> note:           children 0x429001c8
> note:   node 0x429001c8 (max_nunits=4, refcnt=2) vector(4) float
> note:   op template: x_9 = _16 ? _4 : x_36;
> note:           stmt 0 x_9 = _16 ? _4 : x_36;
> note:           children 0x42900270 0x42900318 0x429003c0
> note:   node 0x42900270 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32>
> note:   op template: _16 = _4 > x_36;
> note:           stmt 0 _16 = _4 > x_36;
> note:           children 0x42900318 0x429003c0
> note:   node 0x42900318 (max_nunits=4, refcnt=2) vector(4) float
> note:   op template: _4 = a[i_43];
> note:           stmt 0 _4 = a[i_43];
> note:   node 0x429003c0 (max_nunits=4, refcnt=2) vector(4) float
> note:   op template: x_36 = PHI <x_9(12), x_22(4)>
> note:           stmt 0 x_36 = PHI <x_9(12), x_22(4)>
> note:           children 0x429001c8 0x42900468
> note:   node (external) 0x42900468 (max_nunits=1, refcnt=1) vector(4) float
> note:           { x_22 }
> 
> it also looks like we missed simplifying a > b ? a : b into just a max.
> 
> Before we failed during analysis in the block that was removed:
> 
> missed:   Unsupported loop-closed phi in outer-loop.
> missed:  bad operation or unsupported loop bound
> 
> and now it's a costing issue, as it's an inner loop, 
> 
> You can reduce it down to 
> 
> #define iterations 100000
> #define LEN_1D 32000
> 
> float a[LEN_1D];
> 
> int main()
> {
>     float x;
>     for (int nl = 0; nl < iterations; nl++) {
>         x = a[0];
>         for (int i = 0; i < LEN_1D; ++i) {
>             if (a[i] > x) {
>                 x = a[i];
>             }
>         }
>     }
> 
>     return x > 1;
> } 
> 
> It looks like the access of a[0] in the outer loop is making it treat the 
> inner
> loop as only being able to access one element at a time.

Outer loop vectorization basically executes the inner loop in "scalar"
but N outer loop iterations at the same time.  Since the outer
loop iteration is completely pointless this is expected.  So,
I'd say it's a missed optimization that we do not elide the outer
loop completely.  Note the benchmark should now execute iterations/4
outer loop iterations only.  So if the inner loop is now slower than
the scalar inner loop then it's a costing issue.

Richard.

Reply via email to