Hi, gather instructions are rather hard to implement in hardware and except for skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the degree I did not find real world loop where gather would help on Zen. This patch simply adds a knob to disable its autogeneration (builtin still works). I have considered two alternatives 1) tune this with x86-tune-costs because gather is still profitable than scalar code if we do very expensive operations (such as sequence of divides) on the values gathered/scattered 2) implement expansion of gathers into primitive instructions. This is faster as semantics of gather is bit weird and not fully needed for our vectorizer
I did not have luck to get any good results out of 1 alone as cost model is not very realistic. 1+2 probably makes sense but we can do this incrementally as that would make most sense to be implemented generically in vectorizer on the top of this change. Given that gather is problematic even on skylake+ as shown in the PR (which has most optimized implementation of it) it is good to have a knob to control its codegen at first place. I have also disabled gathers for generic. This is because its use causes some two-digit regression on zen for spec2k17 while there are no measurable benefits on Intel. Note that this affects only -march=<somethig supporting gather> -mtune=generic as by default we do not use AVX2. Bootstrapped/regtested x86_64-linux, plan to commit it later today if there are no complains. Honza PR target/81616 * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER. * i386.h (TARGET_USE_GATHER): Define. * x86-tune.def (X86_TUNE_USE_GATHER): New. Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 256369) +++ config/i386/i386.c (working copy) @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre bool si; enum ix86_builtins code; - if (! TARGET_AVX2) + if (! TARGET_AVX2 || !TARGET_USE_GATHER) return NULL_TREE; if ((TREE_CODE (index_type) != INTEGER_TYPE Index: config/i386/i386.h =================================================================== --- config/i386/i386.h (revision 256369) +++ config/i386/i386.h (working copy) @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[ ix86_tune_features[X86_TUNE_SLOW_PSHUFB] #define TARGET_AVOID_4BYTE_PREFIXES \ ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES] +#define TARGET_USE_GATHER \ + ix86_tune_features[X86_TUNE_USE_GATHER] #define TARGET_FUSE_CMP_AND_BRANCH_32 \ ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32] #define TARGET_FUSE_CMP_AND_BRANCH_64 \ Index: config/i386/x86-tune.def =================================================================== --- config/i386/x86-tune.def (revision 256369) +++ config/i386/x86-tune.def (working copy) @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", m_SILVERMONT | m_INTEL) +/* X86_TUNE_USE_GATHER: Use gather instructions. */ +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather", + ~(m_ZNVER1 | m_GENERIC)) + /*****************************************************************************/ /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */ /*****************************************************************************/