[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2023-07-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
   Target Milestone|--- |12.0
 Resolution|--- |FIXED

--- Comment #12 from Richard Biener  ---
Fixed.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-15 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #11 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:243e0a5b1942879bc005bf150a744e69a4fcdc87

commit r12-3542-g243e0a5b1942879bc005bf150a744e69a4fcdc87
Author: liuhongt 
Date:   Mon Sep 13 10:27:51 2021 +0800

Output vextract{i,f}{32x4,64x2} for (vec_select:(reg:Vmode) idx) when
byte_offset of idx % 16 == 0.

2020-09-13  Hongtao Liu  
Peter Cordes  
gcc/ChangeLog:

PR target/91103
* config/i386/sse.md (extract_suf): Add V8SF/V8SI/V4DF/V4DI.
(*vec_extract_valign): Output
vextract{i,f}{32x4,64x2} instruction when byte_offset % 16 ==
0.

gcc/testsuite/ChangeLog:

PR target/91103
* gcc.target/i386/pr91103-1.c: Add extract tests.
* gcc.target/i386/pr91103-2.c: Ditto.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-12 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #10 from Hongtao.liu  ---
(In reply to Peter Cordes from comment #9)
> Thanks for implementing my idea :)
> 
> (In reply to Hongtao.liu from comment #6)
> > For elements located above 128bits, it seems always better(?) to use
> > valign{d,q}
> 
> TL:DR:
>  I think we should still use vextracti* / vextractf* when that can get the
> job done in a single instruction, especially when the VEX-encoded
> vextracti/f128 can save a byte of code size for v[4].
> 
> Extracts are simpler shuffles that might have better throughput on some
> future CPUs, especially the upcoming Zen4, so even without code-size savings
> we should use them when possible.  Tiger Lake has a 256-bit shuffle unit on
> port 1 that supports some common shuffles (like vpshufb); a future Intel
> might add 256->128-bit extracts to that.
> 
> It might also save a tiny bit of power, allowing on-average higher turbo
> clocks.
> 
> ---
> 
> On current CPUs with AVX-512, valignd is about equal to a single vextract,
Yes, they're equal but consider the below comments, i thinks it's reasonable to
use vextract instead of valign for byte_offset % 16 == 0.

> and better than multiple instruction.  It doesn't really have downsides on
> current Intel, since I think Intel has continued to not have int/FP bypass
> delays for shuffles.
> 
> We don't know yet what AMD's Zen4 implementation of AVX-512 will look like. 
> If it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other
> than insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle
> like valignd probably costs more than 2 uops.  (vpermq is more than 2 uops
> on Piledriver/Zen1).  But a 128-bit extract will probably cost just one uop.
> (And especially an extract of the high 256 might be very cheap and low
> latency, like vextracti128 on Zen1, so we might prefer vextracti64x4 for
> v[8].)
> 
> So this change is good, but using a vextracti64x2 or vextracti64x4 could be
> a useful peephole optimization when byte_offset % 16 == 0.  Or of course
> vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible
> with an EVEX-encoded instruction).
> 
> vextractf-whatever allows an FP shuffle on FP data in case some future CPU
> cares about that for shuffles.
> 
> An extract is a simpler shuffle that might have better throughput on some
> future CPU even with full-width execution units.  Some future Intel CPU
> might add support for vextract uops to the extra shuffle unit on port 1. 
> (Which is available when no 512-bit uops are in flight.)  Currently (Ice
> Lake / Tiger Lake) it can only run some common shuffles like vpshufb ymm,
> but not including any vextract or valign.  Of course port 1 vector ALUs are
> shut down when 512-bit uops are in flight, but could be relevant for __m256
> vectors on these hypothetical future CPUs.
> 
> When we can get the job done with a single vextract-something, we should use
> that instead of valignd.  Otherwise use valignd.
> 
> We already check the index for low-128 special cases to use vunpckhqdq vs.
> vpshufd (or vpsrldq) or similar FP shuffles.
> 
> -
> 
> On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be
> zero), an extract that only writes a 128-bit register will keep them clean
> (even if it reads a ZMM), not needing a VZEROUPPER.  Since VZEROUPPER is
> only needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function
> like
> 
> float foo(float *p) {
>   some vector stuff that can use high zmm regs;
>   return scalar that happens to be from the middle of a vector;
> }
> 
> could vextract into XMM0, but would need vzeroupper if it used valignd into
> ZMM0.
> 
> (Also related
> https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-
> for-turbo-clocks-to-recover-after-a-512-bit-instruc re reading a ZMM at all
> and turbo clock).
> 
> ---
> 
> Having known zeros outside the low 128 bits (from writing an xmm instead of
> rotating a zmm) is unlikely to matter, although for FP stuff copying fewer
> elements that might be subnormal could happen to be an advantage, maybe
> saving an FP assist for denormal.  We're unlikely to be able to take
> advantage of it to save instructions/uops (like OR instead of blend).  But
> it's not worse to use a single extract instruction instead of a single
> valignd.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #9 from Peter Cordes  ---
Thanks for implementing my idea :)

(In reply to Hongtao.liu from comment #6)
> For elements located above 128bits, it seems always better(?) to use
> valign{d,q}

TL:DR:
 I think we should still use vextracti* / vextractf* when that can get the job
done in a single instruction, especially when the VEX-encoded vextracti/f128
can save a byte of code size for v[4].

Extracts are simpler shuffles that might have better throughput on some future
CPUs, especially the upcoming Zen4, so even without code-size savings we should
use them when possible.  Tiger Lake has a 256-bit shuffle unit on port 1 that
supports some common shuffles (like vpshufb); a future Intel might add
256->128-bit extracts to that.

It might also save a tiny bit of power, allowing on-average higher turbo
clocks.

---

On current CPUs with AVX-512, valignd is about equal to a single vextract, and
better than multiple instruction.  It doesn't really have downsides on current
Intel, since I think Intel has continued to not have int/FP bypass delays for
shuffles.

We don't know yet what AMD's Zen4 implementation of AVX-512 will look like.  If
it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other than
insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle like
valignd probably costs more than 2 uops.  (vpermq is more than 2 uops on
Piledriver/Zen1).  But a 128-bit extract will probably cost just one uop.  (And
especially an extract of the high 256 might be very cheap and low latency, like
vextracti128 on Zen1, so we might prefer vextracti64x4 for v[8].)

So this change is good, but using a vextracti64x2 or vextracti64x4 could be a
useful peephole optimization when byte_offset % 16 == 0.  Or of course
vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible
with an EVEX-encoded instruction).

vextractf-whatever allows an FP shuffle on FP data in case some future CPU
cares about that for shuffles.

An extract is a simpler shuffle that might have better throughput on some
future CPU even with full-width execution units.  Some future Intel CPU might
add support for vextract uops to the extra shuffle unit on port 1.  (Which is
available when no 512-bit uops are in flight.)  Currently (Ice Lake / Tiger
Lake) it can only run some common shuffles like vpshufb ymm, but not including
any vextract or valign.  Of course port 1 vector ALUs are shut down when
512-bit uops are in flight, but could be relevant for __m256 vectors on these
hypothetical future CPUs.

When we can get the job done with a single vextract-something, we should use
that instead of valignd.  Otherwise use valignd.

We already check the index for low-128 special cases to use vunpckhqdq vs.
vpshufd (or vpsrldq) or similar FP shuffles.

-

On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be
zero), an extract that only writes a 128-bit register will keep them clean
(even if it reads a ZMM), not needing a VZEROUPPER.  Since VZEROUPPER is only
needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function like

float foo(float *p) {
  some vector stuff that can use high zmm regs;
  return scalar that happens to be from the middle of a vector;
}

could vextract into XMM0, but would need vzeroupper if it used valignd into
ZMM0.

(Also related
https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc
re reading a ZMM at all and turbo clock).

---

Having known zeros outside the low 128 bits (from writing an xmm instead of
rotating a zmm) is unlikely to matter, although for FP stuff copying fewer
elements that might be subnormal could happen to be an advantage, maybe saving
an FP assist for denormal.  We're unlikely to be able to take advantage of it
to save instructions/uops (like OR instead of blend).  But it's not worse to
use a single extract instruction instead of a single valignd.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-08 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #8 from Hongtao.liu  ---
Fixed in GCC12.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-08 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #7 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:60eec23b5eda0f350e572586eee738eab0804a74

commit r12-3425-g60eec23b5eda0f350e572586eee738eab0804a74
Author: liuhongt 
Date:   Wed Sep 8 16:19:37 2021 +0800

Optimize vec_extract for 256/512-bit vector when index exceeds the lower
128 bits.

-   vextracti32x8   $0x1, %zmm0, %ymm0
-   vmovd   %xmm0, %eax
+   valignd $8, %zmm0, %zmm0, %zmm1
+   vmovd   %xmm1, %eax

-   vextracti32x8   $0x1, %zmm0, %ymm0
-   vextracti128$0x1, %ymm0, %xmm0
-   vpextrd $3, %xmm0, %eax
+   valignd $15, %zmm0, %zmm0, %zmm1
+   vmovd   %xmm1, %eax

-   vextractf64x2   $0x1, %ymm0, %xmm0
+   valignq $2, %ymm0, %ymm0, %ymm0

-   vextractf64x4   $0x1, %zmm0, %ymm0
-   vextractf64x2   $0x1, %ymm0, %xmm0
-   vunpckhpd   %xmm0, %xmm0, %xmm0
+   valignq $7, %zmm0, %zmm0, %zmm0

gcc/ChangeLog:

PR target/91103
* config/i386/sse.md
(*vec_extract_valign):
New define_insn.

gcc/testsuite/ChangeLog:

PR target/91103
* gcc.target/i386/pr91103-1.c: New test.
* gcc.target/i386/pr91103-2.c: New test.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-08 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #6 from Hongtao.liu  ---
For elements located above 128bits, it seems always better(?) to use
valign{d,q}

diff --git a/origin.s b/after.s
index 9a7dfee..9a23f7e 100644
--- a/origin.s
+++ b/after.s
@@ -6,7 +6,7 @@
 foo_v8sf_4:
 .LFB0:
.cfi_startproc
-   vextractf128$0x1, %ymm0, %xmm0
+   valignd $4, %ymm0, %ymm0, %ymm0
ret
.cfi_endproc
 .LFE0:
@@ -17,8 +17,7 @@ foo_v8sf_4:
 foo_v8sf_7:
 .LFB1:
.cfi_startproc
-   vextractf128$0x1, %ymm0, %xmm0
-   vshufps $255, %xmm0, %xmm0, %xmm0
+   valignd $7, %ymm0, %ymm0, %ymm0
ret
.cfi_endproc
 .LFE1:
@@ -29,8 +28,8 @@ foo_v8sf_7:
 foo_v8si_4:
 .LFB2:
.cfi_startproc
-   vextracti128$0x1, %ymm0, %xmm0
-   vmovd   %xmm0, %eax
+   valignd $4, %ymm0, %ymm0, %ymm1
+   vmovd   %xmm1, %eax
ret
.cfi_endproc
 .LFE2:
@@ -41,8 +40,8 @@ foo_v8si_4:
 foo_v8si_7:
 .LFB3:
.cfi_startproc
-   vextracti128$0x1, %ymm0, %xmm0
-   vpextrd $3, %xmm0, %eax
+   valignd $7, %ymm0, %ymm0, %ymm1
+   vmovd   %xmm1, %eax
ret
.cfi_endproc
 .LFE3:
@@ -53,7 +52,7 @@ foo_v8si_7:
 foo_v16sf_8:
 .LFB4:
.cfi_startproc
-   vextractf32x8   $0x1, %zmm0, %ymm0
+   valignd $8, %zmm0, %zmm0, %zmm0
ret
.cfi_endproc
 .LFE4:
@@ -64,9 +63,7 @@ foo_v16sf_8:
 foo_v16sf_15:
 .LFB5:
.cfi_startproc
-   vextractf32x8   $0x1, %zmm0, %ymm0
-   vextractf128$0x1, %ymm0, %xmm0
-   vshufps $255, %xmm0, %xmm0, %xmm0
+   valignd $15, %zmm0, %zmm0, %zmm0
ret
.cfi_endproc
 .LFE5:
@@ -77,8 +74,8 @@ foo_v16sf_15:
 foo_v16si_8:
 .LFB6:
.cfi_startproc
-   vextracti32x8   $0x1, %zmm0, %ymm0
-   vmovd   %xmm0, %eax
+   valignd $8, %zmm0, %zmm0, %zmm1
+   vmovd   %xmm1, %eax
ret
.cfi_endproc
 .LFE6:
@@ -89,9 +86,8 @@ foo_v16si_8:
 foo_v16si_15:
 .LFB7:
.cfi_startproc
-   vextracti32x8   $0x1, %zmm0, %ymm0
-   vextracti128$0x1, %ymm0, %xmm0
-   vpextrd $3, %xmm0, %eax
+   valignd $15, %zmm0, %zmm0, %zmm1
+   vmovd   %xmm1, %eax
ret
.cfi_endproc
 .LFE7:
@@ -102,7 +98,7 @@ foo_v16si_15:
 foo_v4df_2:
 .LFB8:
.cfi_startproc
-   vextractf64x2   $0x1, %ymm0, %xmm0
+   valignq $2, %ymm0, %ymm0, %ymm0
ret
.cfi_endproc
 .LFE8:
@@ -113,8 +109,7 @@ foo_v4df_2:
 foo_v4df_3:
 .LFB9:
.cfi_startproc
-   vextractf64x2   $0x1, %ymm0, %xmm0
-   vunpckhpd   %xmm0, %xmm0, %xmm0
+   valignq $3, %ymm0, %ymm0, %ymm0
ret
.cfi_endproc
 .LFE9:
@@ -125,8 +120,8 @@ foo_v4df_3:
 foo_v4di_2:
 .LFB10:
.cfi_startproc
-   vextracti64x2   $0x1, %ymm0, %xmm0
-   vmovq   %xmm0, %rax
+   valignq $2, %ymm0, %ymm0, %ymm1
+   vmovq   %xmm1, %rax
ret
.cfi_endproc
 .LFE10:
@@ -137,8 +132,8 @@ foo_v4di_2:
 foo_v4di_3:
 .LFB11:
.cfi_startproc
-   vextracti64x2   $0x1, %ymm0, %xmm0
-   vpextrq $1, %xmm0, %rax
+   valignq $3, %ymm0, %ymm0, %ymm1
+   vmovq   %xmm1, %rax
ret
.cfi_endproc
 .LFE11:
@@ -149,7 +144,7 @@ foo_v4di_3:
 foo_v8df_4:
 .LFB12:
.cfi_startproc
-   vextractf64x4   $0x1, %zmm0, %ymm0
+   valignq $4, %zmm0, %zmm0, %zmm0
ret
.cfi_endproc
 .LFE12:
@@ -160,9 +155,7 @@ foo_v8df_4:
 foo_v8df_7:
 .LFB13:
.cfi_startproc
-   vextractf64x4   $0x1, %zmm0, %ymm0
-   vextractf64x2   $0x1, %ymm0, %xmm0
-   vunpckhpd   %xmm0, %xmm0, %xmm0
+   valignq $7, %zmm0, %zmm0, %zmm0
ret
.cfi_endproc
 .LFE13:
@@ -173,8 +166,8 @@ foo_v8df_7:
 foo_v8di_4:
 .LFB14:
.cfi_startproc
-   vextracti64x4   $0x1, %zmm0, %ymm0
-   vmovq   %xmm0, %rax
+   valignq $4, %zmm0, %zmm0, %zmm1
+   vmovq   %xmm1, %rax
ret
.cfi_endproc
 .LFE14:
@@ -185,12 +178,11 @@ foo_v8di_4:
 foo_v8di_7:
 .LFB15:
.cfi_startproc
-   vextracti64x4   $0x1, %zmm0, %ymm0
-   vextracti64x2   $0x1, %ymm0, %xmm0
-   vpextrq $1, %xmm0, %rax
+   valignq $7, %zmm0, %zmm0, %zmm1
+   vmovq   %xmm1, %rax
ret
.cfi_endproc
 .LFE15:
.size   foo_v8di_7, .-foo_v8di_7
-   .ident  "GCC: (GNU) 12.0.0 20210907 (experimental)"
+   .ident  "GCC: (GNU) 12.0.0 20210908 (experimental)"
.section.note.GNU-stack,"",@progbits

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-04 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-07-09
 Ever confirmed|0   |1

--- Comment #5 from Richard Biener  ---
Thanks for the detailed analysis - currently the vectorizer makes N explicit
element extracts and N explicit scalar stores.  That's not very friendly for
the targets to generate the kind of mixed code from.  I'm thinking of
somehow generalizing how we represent strided stores similar to scatters
but in a way making the index vector implicitely specified by an affine
combination (constant * reg).  It probably needs another set of expanders
for that though.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #4 from Peter Cordes  ---
We should not put any stock in what ICC does for GNU C native vector indexing. 
I think it doesn't know how to optimize that because it *always* spills/reloads
even for `vec[0]` which could be a no-op.  And it's always a full-width spill
(ZMM), not just the low XMM/YMM part that contains the desired element.  I
mainly mentioned ICC in my initial post to suggest the store/reload strategy in
general as an *option*.

ICC also doesn't optimize intriniscs: it pretty much always faithfully
transliterates them to asm.  e.g. v = _mm_add_epi32(v, _mm_set1_epi32(1)); 
twice compiles to two separate paddd instructions, instead of one with a
constant of set1(2).

If we want to see ICC's strided-store strategy, we'd need to write some pure C
that auto-vectorizes.



That said, store/reload is certainly a valid option when we want all the
elements, and gets *more* attractive with wider vectors, where the one extra
store amortizes over more elements.

Strided stores will typically bottleneck on cache/memory bandwidth unless the
destination lines are already hot in L1d.  But if there's other work in the
loop, we care about OoO exec of that work with the stores, so uop throughput
could be a factor.


If we're tuning for Intel Haswell/Skylake with 1 per clock shuffles but 2 loads
+ 1 store per clock throughput (if we avoid indexed addressing modes for
stores), then it's very attractive and unlikely to be a bottleneck.

There's typically spare load execution-unit cycles in a loop that's also doing
stores + other work.  You need every other uop to be (or include) a load to
bottleneck on that at 4 uops per clock, unless you have indexed stores (which
can't run on the simple store-AGU on port 7 and need to run on port 2/3, taking
a cycle from a load).   Cache-split loads do get replayed to grab the 2nd half,
so it costs extra execution-unit pressure as well as extra cache-read cycles.

Intel says Ice will have 2 load + 2 store pipes, and a 2nd shuffle unit.  A
mixed strategy there might be interesting: extract the high 256 bits to memory
with vextractf32x8 and reload it, but shuffle the low 128/256 bits.  That
strategy might be good on earlier CPUs, too.  At least with movss + extractps
stores from the low XMM where we can do that directly.

AMD before Ryzen 2 has only 2 AGUs, so only 2 memory ops per clock, up to one
of which can be a store.  It's definitely worth considering extracting the high
128-bit half of a YMM and using movss then shuffles like vextractps: 2 uops on
Ryzen or AMD.


-

If the stride is small enough (so more than 1 element fits in a vector), we
should consider  shuffle + vmaskmovps  masked stores, or with AVX512 then
AVX512 masked stores.

But for larger strides, AVX512 scatter may get better in the future.  It's
currently (SKX) 43 uops for VSCATTERDPS or ...DD ZMM, so not very friendly to
surrounding code.  It sustains one per 17 clock throughput, slightly worse than
1 element stored per clock cycle.  Same throughput on KNL, but only 4 uops so
it can overlap much better with surrounding code.




For qword elements, we have efficient stores of the high or low half of an XMM.
 A MOVHPS store doesn't need a shuffle uop on most Intel CPUs.  So we only need
1 (YMM) or 3 (ZMM) shuffles to get each of the high 128-bit lanes down to an
XMM register.

Unfortunately on Ryzen, MOVHPS [mem], xmm costs a shuffle+store.  But Ryzen has
shuffle EUs on multiple ports.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

Jakub Jelinek  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com,
   ||jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek  ---
For the constant vector element extraction, it can be done say with:
--- gcc/config/i386/sse.md.jj   2019-07-06 23:55:51.617641994 +0200
+++ gcc/config/i386/sse.md  2019-07-08 12:23:13.315509840 +0200
@@ -9351,7 +9351,7 @@ (define_insn "avx512f_sgetexp")])

-(define_insn "_align"
+(define_insn "_align"
   [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
 (unspec:VI48_AVX512VL [(match_operand:VI48_AVX512VL 1
"register_operand" "v")
   (match_operand:VI48_AVX512VL 2
"nonimmediate_operand" "vm")
--- gcc/config/i386/i386-expand.c.jj2019-07-04 00:18:37.067010375 +0200
+++ gcc/config/i386/i386-expand.c   2019-07-08 12:37:24.687562956 +0200
@@ -14827,6 +14827,14 @@ ix86_expand_vector_extract (bool mmx_ok,
   break;

 case E_V16SFmode:
+  if (elt > 12)
+   {
+ tmp = gen_reg_rtx (V16SImode);
+ vec = gen_lowpart (V16SImode, vec);
+ emit_insn (gen_avx512f_alignv16si (tmp, vec, vec, GEN_INT (elt)));
+ vec = gen_lowpart (V16SFmode, tmp);
+ elt = 0;
+   }
   tmp = gen_reg_rtx (V8SFmode);
   if (elt < 8)
emit_insn (gen_vec_extract_lo_v16sf (tmp, vec));
@@ -14836,6 +14844,14 @@ ix86_expand_vector_extract (bool mmx_ok,
   return;

 case E_V8DFmode:
+  if (elt >= 6)
+   {
+ tmp = gen_reg_rtx (V8DImode);
+ vec = gen_lowpart (V8DImode, vec);
+ emit_insn (gen_avx512f_alignv8di (tmp, vec, vec, GEN_INT (elt)));
+ vec = gen_lowpart (V8DFmode, tmp);
+ elt = 0;
+   }
   tmp = gen_reg_rtx (V4DFmode);
   if (elt < 4)
emit_insn (gen_vec_extract_lo_v8df (tmp, vec));
@@ -14845,6 +14861,13 @@ ix86_expand_vector_extract (bool mmx_ok,
   return;

 case E_V16SImode:
+  if (elt > 12)
+   {
+ tmp = gen_reg_rtx (V16SImode);
+ emit_insn (gen_avx512f_alignv16si (tmp, vec, vec, GEN_INT (elt)));
+ vec = tmp;
+ elt = 0;
+   }
   tmp = gen_reg_rtx (V8SImode);
   if (elt < 8)
emit_insn (gen_vec_extract_lo_v16si (tmp, vec));
@@ -14854,6 +14877,13 @@ ix86_expand_vector_extract (bool mmx_ok,
   return;

 case E_V8DImode:
+  if (elt >= 6)
+   {
+ tmp = gen_reg_rtx (V8DImode);
+ emit_insn (gen_avx512f_alignv8di (tmp, vec, vec, GEN_INT (elt)));
+ vec = tmp;
+ elt = 0;
+   }
   tmp = gen_reg_rtx (V4DImode);
   if (elt < 4)
emit_insn (gen_vec_extract_lo_v8di (tmp, vec));

The question is in which cases it is beneficial, from pure -Os POV the
valignd/valignq is one instruction and for integer extractions needs a vmovd
afterwards,
so for 64-bit extraction might be also useful for double [3] and [5] (for long
long it is two insns in both cases), for 32-bit extraction
likely also shorter for float [5], [6], [7], [9], [10], [11], [12], but not for
int.
But I admit I have no idea on how fast what is.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #2 from Richard Biener  ---
(In reply to Richard Biener from comment #1)
> So when the vectorizer has the need to use strided stores it would be
> cheapest
> to spill the vector and do N element loads and stores?  I guess we can easily
> get bottle-necked by the load/store op bandwith here?  That is, the
> vectorizer needs
> 
>   for (lane)
> dest[stride * lane] = vector[lane];
> 
> thus store a specific (constant) lane of a vector to memory, for each
> vector lane.  (we could use a scatter store here but only AVX512 has that
> and builing the index vector could be tricky and not supported for all
> element types)

Indeed ICC seems to spill for AVX and AVX512 for

typedef int vsi __attribute__((vector_size(SIZE)));
void foo (vsi v, int *p, int *o)
{
  for (int i = 0; i < sizeof(vsi)/4; ++i)
p[o[i]] = v[i];
}

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
So when the vectorizer has the need to use strided stores it would be cheapest
to spill the vector and do N element loads and stores?  I guess we can easily
get bottle-necked by the load/store op bandwith here?  That is, the
vectorizer needs

  for (lane)
dest[stride * lane] = vector[lane];

thus store a specific (constant) lane of a vector to memory, for each
vector lane.  (we could use a scatter store here but only AVX512 has that
and builing the index vector could be tricky and not supported for all
element types)