Hi, Currently when loop is vectorized we adjust its nb_iterations_upper_bound by dividing it by VF. This is incorrect since nb_iterations_upper_bound is upper bound for (<number of loop iterations> - 1) and therefore simple dividing it by VF in many cases gives us bounds greater than a real one. Correct value would be ((nb_iterations_upper_bound + 1) / VF - 1).
Also decrement due to peeling for gaps should happen before we scale it by VF because peeling applies to a scalar loop, not vectorized one. This patch modifies nb_iterations_upper_bound computation to resolve these issues. Running regression testing I got one fail due to optimized loop. Heres is a loop: foo (signed char s) { signed char i; for (i = 0; i < s; i++) yy[i] = (signed int) i; } Here we vectorize for AVX512 using VF=64. Original loop has max 127 iterations and therefore vectorized loop may be executed only once. With the patch applied compiler detects it and transforms loop into BB with just stores of constants vectors into yy. Test was adjusted to increase number of possible iterations. A copy of test was added to check we can optimize out the original loop. Bootstrapped and regtested on x86_64-pc-linux-gnu. OK for trunk? Thanks, Ilya -- gcc/ 2016-04-21 Ilya Enkovich <ilya.enkov...@intel.com> * tree-vect-loop.c (vect_transform_loop): Fix nb_iterations_upper_bound computation for vectorized loop. gcc/testsuite/ 2016-04-21 Ilya Enkovich <ilya.enkov...@intel.com> * gcc.target/i386/vect-unpack-2.c (avx512bw_test): Avoid optimization of vector loop. * gcc.target/i386/vect-unpack-3.c: New test. diff --git a/gcc/testsuite/gcc.target/i386/vect-unpack-2.c b/gcc/testsuite/gcc.target/i386/vect-unpack-2.c index 4825248..51c518e 100644 --- a/gcc/testsuite/gcc.target/i386/vect-unpack-2.c +++ b/gcc/testsuite/gcc.target/i386/vect-unpack-2.c @@ -6,19 +6,22 @@ #define N 120 signed int yy[10000]; +signed char zz[10000]; void -__attribute__ ((noinline)) foo (signed char s) +__attribute__ ((noinline,noclone)) foo (int s) { - signed char i; + int i; for (i = 0; i < s; i++) - yy[i] = (signed int) i; + yy[i] = zz[i]; } void avx512bw_test () { signed char i; + for (i = 0; i < N; i++) + zz[i] = i; foo (N); for (i = 0; i < N; i++) if ( (signed int)i != yy [i] ) diff --git a/gcc/testsuite/gcc.target/i386/vect-unpack-3.c b/gcc/testsuite/gcc.target/i386/vect-unpack-3.c new file mode 100644 index 0000000..eb8a93e --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-unpack-3.c @@ -0,0 +1,29 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -fdump-tree-vect-details -ftree-vectorize -ffast-math -mavx512bw -save-temps" } */ +/* { dg-require-effective-target avx512bw } */ + +#include "avx512bw-check.h" + +#define N 120 +signed int yy[10000]; + +void +__attribute__ ((noinline)) foo (signed char s) +{ + signed char i; + for (i = 0; i < s; i++) + yy[i] = (signed int) i; +} + +void +avx512bw_test () +{ + signed char i; + foo (N); + for (i = 0; i < N; i++) + if ( (signed int)i != yy [i] ) + abort (); +} + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-assembler-not "vpmovsxbw\[ \\t\]+\[^\n\]*%zmm" } } */ diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index d813b86..da98211 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -6921,11 +6921,13 @@ vect_transform_loop (loop_vec_info loop_vinfo) /* Reduce loop iterations by the vectorization factor. */ scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor), expected_iterations / vectorization_factor); - loop->nb_iterations_upper_bound - = wi::udiv_floor (loop->nb_iterations_upper_bound, vectorization_factor); if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && loop->nb_iterations_upper_bound != 0) loop->nb_iterations_upper_bound = loop->nb_iterations_upper_bound - 1; + loop->nb_iterations_upper_bound + = wi::udiv_floor (loop->nb_iterations_upper_bound + 1, + vectorization_factor) - 1; + if (loop->any_estimate) { loop->nb_iterations_estimate