On Tue, Sep 16, 2025 at 9:53 AM Liu, Hongtao <hongtao....@intel.com> wrote: > > > > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: Tuesday, September 16, 2025 3:03 PM > > To: Liu, Hongtao <hongtao....@intel.com> > > Cc: gcc-patches@gcc.gnu.org; hjl.to...@gmail.com > > Subject: Re: [PATCH] Remove SPR/GNR/DMR from avx512_{move,store}_by > > pieces tune. > > > > On Tue, Sep 16, 2025 at 7:53 AM liuhongt <hongtao....@intel.com> wrote: > > > > > > From: "hongtao.liu" <hongtao....@intel.com> > > > > > > Align move_max with prefer_vector_width for SPR/GNR/DMR to avoid STLF > > issue. > > > It's similar as previous commit. > > > > > > commit 6ea25c041964bf63014fcf7bb68fb1f5a0a4e123 > > > Author: liuhongt <hongtao....@intel.com> > > > Date: Thu Aug 15 12:54:07 2024 +0800 > > > > > > Align ix86_{move_max,store_max} with vectorizer. > > > > > > When none of mprefer-vector-width, avx256_optimal/avx128_optimal, > > > avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC will > > > set ix86_{move_max,store_max} as max available vector length except > > > for AVX part. > > > > > > if (TARGET_AVX512F_P (opts->x_ix86_isa_flags) > > > && TARGET_EVEX512_P (opts->x_ix86_isa_flags2)) > > > opts->x_ix86_move_max = PVW_AVX512; > > > else > > > opts->x_ix86_move_max = PVW_AVX128; > > > > > > So for -mavx2, vectorizer will choose 256-bit for vectorization, but > > > 128-bit is used for struct copy, there could be a potential STLF issue > > > due to this "misalign". > > > > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > > Ready push to trunk. > > > > Not an objection to this patch, but I wonder if we have too many tunables > > where these kind of interactions cause more problems than positives by > > having them? > My experience is we should align MOVE_MAX with prefer_vector_width, > vectorization and memcpy are 2 main parts which create vector load/store, > they'd better align with each other. > > > > As general comment I'd say store-by-pieces with larger modes is always good, > > but the read part for move-by-pieces can cause STLF fails, not only from > Yes, avx512_store_by_piece is still good, I'll adjust my patch. > > > vectorization but of course also from (non-contigous) scalar stores > > performed > > by user code. > > > > I'm always advising CPU implementors to reduce STLF fail penalties (or allow > > more forwarding cases). On the compilers side we'd have to see to do less > > "local" (per stmt) decisions on what modes to use but take into account > > context - the post-fact fixup by the avoid-store-forwarding pass is likely > > not > > very effective? > I think the pass can't handle inter-procedure pointer analysis, and there're > memcpy and vectorization in caller/callee with different size, worst case is > small store in the caller followed by big load in the callee.
True, IIRC this happened in 538.imagick_r at some point. There's also __attribute__((target)), so caller/callee could disagree about vector ISA (and/or size preference). But I guess that's not happening too often to be of concern. Richard. > > > > > gcc/ChangeLog: > > > > > > * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): > > > (X86_TUNE_AVX512_STORE_BY_PIECES): > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/i386/pieces-memcpy-18.c: Use -mtune=znver5 > > > instead of -mtune=sapphirerapids. > > > * gcc.target/i386/pieces-memcpy-21.c: Ditto. > > > * gcc.target/i386/pieces-memset-46.c: Ditto. > > > * gcc.target/i386/pieces-memset-49.c: Ditto. > > > --- > > > gcc/config/i386/x86-tune.def | 6 ++---- > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c | 2 +- > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c | 2 +- > > > gcc/testsuite/gcc.target/i386/pieces-memset-46.c | 2 +- > > > gcc/testsuite/gcc.target/i386/pieces-memset-49.c | 2 +- > > > 5 files changed, 6 insertions(+), 8 deletions(-) > > > > > > diff --git a/gcc/config/i386/x86-tune.def > > > b/gcc/config/i386/x86-tune.def index a86cbad281c..8e642aa92a2 100644 > > > --- a/gcc/config/i386/x86-tune.def > > > +++ b/gcc/config/i386/x86-tune.def > > > @@ -625,14 +625,12 @@ DEF_TUNE > > (X86_TUNE_AVX256_STORE_BY_PIECES, > > > "avx256_store_by_pieces", > > > /* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with > > 512-bit > > > AVX instructions. */ > > > DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, > > "avx512_move_by_pieces", > > > - m_SAPPHIRERAPIDS | m_GRANITERAPIDS | m_GRANITERAPIDS_D > > > - | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5) > > > + m_ZNVER4 | m_ZNVER5) > > > > > > /* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with > > 512-bit > > > AVX instructions. */ > > > DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, > > "avx512_store_by_pieces", > > > - m_SAPPHIRERAPIDS | m_GRANITERAPIDS | m_GRANITERAPIDS_D > > > - | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5) > > > + m_ZNVER4 | m_ZNVER5) > > > > > > /* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues for > > 512-bit > > > vectorized loops. */ > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > index b15a0db9ff0..b4995ac0598 100644 > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > @@ -1,5 +1,5 @@ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O2 -march=sapphirerapids" } */ > > > +/* { dg-options "-O2 -march=znver5" } */ > > > > > > extern char *dst, *src; > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > index ef439f20f74..804a2989d64 100644 > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > @@ -1,5 +1,5 @@ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */ > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" } */ > > > > > > extern char *dst, *src; > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > index be1b054eed2..43d636ee3ff 100644 > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > @@ -1,5 +1,5 @@ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O2 -march=sapphirerapids" } */ > > > +/* { dg-options "-O2 -march=znver5" } */ > > > > > > extern char *dst; > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > index ad43f89a9bd..ca4933ac1d8 100644 > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > @@ -1,5 +1,5 @@ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */ > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" } */ > > > > > > extern char *dst; > > > > > > -- > > > 2.34.1 > > >