On 11/28/2017 08:15 AM, Richard Biener wrote: > > The following adds a new target hook, targetm.vectorize.split_reduction, > which allows the target to specify a preferred mode to perform the > final reducion on using either vector shifts or scalar extractions. > Up to that mode the vector reduction result is reduced by combining > lowparts and highparts recursively. This avoids lane-crossing operations > when doing AVX256 on Zen and Bulldozer and also speeds up things on > Haswell (I verified ~20% speedup on Broadwell). > > Thus the patch implements the target hook on x86 to _always_ prefer > SSE modes for the final reduction. > > For the testcase in the bugzilla > > int sumint(const int arr[]) { > arr = __builtin_assume_aligned(arr, 64); > int sum=0; > for (int i=0 ; i<1024 ; i++) > sum+=arr[i]; > return sum; > } > > this changes -O3 -mavx512f code from > > sumint: > .LFB0: > .cfi_startproc > vpxord %zmm0, %zmm0, %zmm0 > leaq 4096(%rdi), %rax > .p2align 4,,10 > .p2align 3 > .L2: > vpaddd (%rdi), %zmm0, %zmm0 > addq $64, %rdi > cmpq %rdi, %rax > jne .L2 > vpxord %zmm1, %zmm1, %zmm1 > vshufi32x4 $78, %zmm1, %zmm0, %zmm2 > vpaddd %zmm2, %zmm0, %zmm0 > vmovdqa64 .LC0(%rip), %zmm2 > vpermi2d %zmm1, %zmm0, %zmm2 > vpaddd %zmm2, %zmm0, %zmm0 > vmovdqa64 .LC1(%rip), %zmm2 > vpermi2d %zmm1, %zmm0, %zmm2 > vpaddd %zmm2, %zmm0, %zmm0 > vmovdqa64 .LC2(%rip), %zmm2 > vpermi2d %zmm1, %zmm0, %zmm2 > vpaddd %zmm2, %zmm0, %zmm0 > vmovd %xmm0, %eax > > to > > sumint: > .LFB0: > .cfi_startproc > vpxord %zmm0, %zmm0, %zmm0 > leaq 4096(%rdi), %rax > .p2align 4,,10 > .p2align 3 > .L2: > vpaddd (%rdi), %zmm0, %zmm0 > addq $64, %rdi > cmpq %rdi, %rax > jne .L2 > vextracti64x4 $0x1, %zmm0, %ymm1 > vpaddd %ymm0, %ymm1, %ymm1 > vmovdqa %xmm1, %xmm0 > vextracti128 $1, %ymm1, %xmm1 > vpaddd %xmm1, %xmm0, %xmm0 > vpsrldq $8, %xmm0, %xmm1 > vpaddd %xmm1, %xmm0, %xmm0 > vpsrldq $4, %xmm0, %xmm1 > vpaddd %xmm1, %xmm0, %xmm0 > vmovd %xmm0, %eax > > and for -O3 -mavx2 from > > sumint: > .LFB0: > .cfi_startproc > vpxor %xmm0, %xmm0, %xmm0 > leaq 4096(%rdi), %rax > .p2align 4,,10 > .p2align 3 > .L2: > vpaddd (%rdi), %ymm0, %ymm0 > addq $32, %rdi > cmpq %rdi, %rax > jne .L2 > vpxor %xmm1, %xmm1, %xmm1 > vperm2i128 $33, %ymm1, %ymm0, %ymm2 > vpaddd %ymm2, %ymm0, %ymm0 > vperm2i128 $33, %ymm1, %ymm0, %ymm2 > vpalignr $8, %ymm0, %ymm2, %ymm2 > vpaddd %ymm2, %ymm0, %ymm0 > vperm2i128 $33, %ymm1, %ymm0, %ymm1 > vpalignr $4, %ymm0, %ymm1, %ymm1 > vpaddd %ymm1, %ymm0, %ymm0 > vmovd %xmm0, %eax > > to > > sumint: > .LFB0: > .cfi_startproc > vpxor %xmm0, %xmm0, %xmm0 > leaq 4096(%rdi), %rax > .p2align 4,,10 > .p2align 3 > .L2: > vpaddd (%rdi), %ymm0, %ymm0 > addq $32, %rdi > cmpq %rdi, %rax > jne .L2 > vmovdqa %xmm0, %xmm1 > vextracti128 $1, %ymm0, %xmm0 > vpaddd %xmm0, %xmm1, %xmm0 > vpsrldq $8, %xmm0, %xmm1 > vpaddd %xmm1, %xmm0, %xmm0 > vpsrldq $4, %xmm0, %xmm1 > vpaddd %xmm1, %xmm0, %xmm0 > vmovd %xmm0, %eax > vzeroupper > ret > > which besides being faster is also smaller (less prefixes). > > SPEC 2k6 results on Haswell (thus AVX2) are neutral. As it merely > effects reduction vectorization epilogues I didn't expect big effects > but for loops that do not run much (more likely with AVX512). > > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. > > Ok for trunk? > > The PR mentions some more tricks to optimize the sequence but > those look like backend only optimizations. > > Thanks, > Richard. > > 2017-11-28 Richard Biener <rguent...@suse.de> > > PR tree-optimization/80846 > * target.def (split_reduction): New target hook. > * targhooks.c (default_split_reduction): New function. > * targhooks.h (default_split_reduction): Declare. > * tree-vect-loop.c (vect_create_epilog_for_reduction): If the > target requests first reduce vectors by combining low and high > parts. > * tree-vect-stmts.c (vect_gen_perm_mask_any): Adjust. > (get_vectype_for_scalar_type_and_size): Export. > * tree-vectorizer.h (get_vectype_for_scalar_type_and_size): Declare. > > * doc/tm.texi.in (TARGET_VECTORIZE_SPLIT_REDUCTION): Document. > * doc/tm.texi: Regenerate. > > i386/ > * config/i386/i386.c (ix86_split_reduction): Implement > TARGET_VECTORIZE_SPLIT_REDUCTION. > > * gcc.target/i386/pr80846-1.c: New testcase. > * gcc.target/i386/pr80846-2.c: Likewise. I'm not a big fan of introducing these kinds of target queries into the gimple optimizers, but I think we've all agreed to allow them to varying degrees within the vectorizer.
So no objections from me. You know the vectorizer bits far better than I :-) jeff