Hi,
gather instructions are rather hard to implement in hardware and except for
skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the
degree I did not find real world loop where gather would help on Zen.
This patch simply adds a knob to disable its autogeneration (builtin still
works). I have considered two alternatives
 1) tune this with x86-tune-costs because gather is still profitable than
    scalar code if we do very expensive operations (such as sequence of divides)
    on the values gathered/scattered
 2) implement expansion of gathers into primitive instructions.  This is faster
    as semantics of gather is bit weird and not fully needed for our vectorizer

I did not have luck to get any good results out of 1 alone as cost model is not
very realistic.  1+2 probably makes sense but we can do this incrementally as
that would make most sense to be implemented generically in vectorizer on the
top of this change.

Given that gather is problematic even on skylake+ as shown in the PR (which has
most optimized implementation of it) it is good to have a knob to control its
codegen at first place.

I have also disabled gathers for generic.  This is because its use causes
some two-digit regression on zen for spec2k17 while there are no measurable
benefits on Intel.  Note that this affects only
-march=<somethig supporting gather> -mtune=generic
as by default we do not use AVX2.

Bootstrapped/regtested x86_64-linux, plan to commit it later today if there
are no complains.

Honza

        PR target/81616
        * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER.
        * i386.h (TARGET_USE_GATHER): Define.
        * x86-tune.def (X86_TUNE_USE_GATHER): New.
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 256369)
+++ config/i386/i386.c  (working copy)
@@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre
   bool si;
   enum ix86_builtins code;
 
-  if (! TARGET_AVX2)
+  if (! TARGET_AVX2 || !TARGET_USE_GATHER)
     return NULL_TREE;
 
   if ((TREE_CODE (index_type) != INTEGER_TYPE
Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h  (revision 256369)
+++ config/i386/i386.h  (working copy)
@@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[
        ix86_tune_features[X86_TUNE_SLOW_PSHUFB]
 #define TARGET_AVOID_4BYTE_PREFIXES \
        ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES]
+#define TARGET_USE_GATHER \
+       ix86_tune_features[X86_TUNE_USE_GATHER]
 #define TARGET_FUSE_CMP_AND_BRANCH_32 \
        ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
 #define TARGET_FUSE_CMP_AND_BRANCH_64 \
Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def    (revision 256369)
+++ config/i386/x86-tune.def    (working copy)
@@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps
 DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
           m_SILVERMONT | m_INTEL)
 
+/* X86_TUNE_USE_GATHER: Use gather instructions.  */
+DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
+          ~(m_ZNVER1 | m_GENERIC))
+
 /*****************************************************************************/
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
 /*****************************************************************************/

Reply via email to