[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2016-12-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Uroš Bizjak  ---
(In reply to Allan Jensen from comment #11)
> I think this one could probably be closed though.

Fixed.

[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2016-12-10 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #11 from Allan Jensen  ---
The think the issue I noted is completely separate from this one, so I opened
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 to deal with it.

I think this one could probably be closed though.

[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2016-12-10 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #10 from Allan Jensen  ---
No I mean it triggers when you compile with -mavx2, it is solved with
-march=haswell. It appears the issue is the tune flag
X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL is set for all processors that support
avx2, but if you use generic+avx2, it still pessimistically optimizes for
pre-avx2 processors setting MASK_AVX256_SPLIT_UNALIGNED_LOAD.

Though since there are two controlling flags and the second
X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL is still set for some avx2 processors
(btver and znver) besides generic, it is harder to argue what generic+avx2
should do there.

[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2016-12-10 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #9 from Marc Glisse  ---
(In reply to Allan Jensen from comment #7)
> This is significantly worse with integer operands.
> 
> _mm256_storeu_si256((__m256i *)&data[3],
> _mm256_add_epi32(_mm256_loadu_si256((const __m256i *)&data[0]),
>  _mm256_loadu_si256((const __m256i *)&data[1]))
> );

Please don't post isolated lines of code, always complete examples ready to be
copy-pasted and compiled. The declaration of data is relevant to the generated
code.

> compiles to:
> 
> vmovdqu 0x20(%rax),%xmm0
> vinserti128 $0x1,0x30(%rax),%ymm0,%ymm0
> vmovdqu (%rax),%xmm1
> vinserti128 $0x1,0x10(%rax),%ymm1,%ymm1
> vpaddd %ymm1,%ymm0,%ymm0
> vmovups %xmm0,0x60(%rax)
> vextracti128 $0x1,%ymm0,0x70(%rax)

With trunk and -march=skylake (or haswell), I can get

vmovdqu data(%rip), %ymm0
vpaddd  data+32(%rip), %ymm0, %ymm0
vmovdqu %ymm0, data+96(%rip)

so this looks fixed?

[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2016-12-10 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #8 from Allan Jensen  ---
Note this happens with -mavx2, but not with -march=haswell. It appears the
tuning is a bit too pessimistic when avx2 is enabled on generic x64.

[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2016-12-10 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

Allan Jensen  changed:

   What|Removed |Added

 CC||linux at carewolf dot com

--- Comment #7 from Allan Jensen  ---
This is significantly worse with integer operands.

_mm256_storeu_si256((__m256i *)&data[3],
_mm256_add_epi32(_mm256_loadu_si256((const __m256i *)&data[0]),
 _mm256_loadu_si256((const __m256i *)&data[1]))
);

compiles to:

vmovdqu 0x20(%rax),%xmm0
vinserti128 $0x1,0x30(%rax),%ymm0,%ymm0
vmovdqu (%rax),%xmm1
vinserti128 $0x1,0x10(%rax),%ymm1,%ymm1
vpaddd %ymm1,%ymm0,%ymm0
vmovups %xmm0,0x60(%rax)
vextracti128 $0x1,%ymm0,0x70(%rax)

[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2013-10-30 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #6 from Jakub Jelinek  ---
Author: jakub
Date: Wed Oct 30 17:59:44 2013
New Revision: 204219

URL: http://gcc.gnu.org/viewcvs?rev=204219&root=gcc&view=rev
Log:
PR target/47754
* config/i386/i386.c (ix86_avx256_split_vector_move_misalign): If
op1 is misaligned_operand, just use *mov_internal insn
rather than UNSPEC_LOADU load.
(ix86_expand_vector_move_misalign): Likewise (for TARGET_AVX only).
Avoid gen_lowpart on op0 if it isn't MEM.

* gcc.target/i386/avx256-unaligned-load-1.c: Adjust scan-assembler
and scan-assembler-not regexps.
* gcc.target/i386/avx256-unaligned-load-2.c: Likewise.
* gcc.target/i386/avx256-unaligned-load-3.c: Likewise.
* gcc.target/i386/avx256-unaligned-load-4.c: Likewise.
* gcc.target/i386/l_fma_float_1.c: Use pattern for
scan-assembler-times instead of just one insn name.
* gcc.target/i386/l_fma_float_2.c: Likewise.
* gcc.target/i386/l_fma_float_3.c: Likewise.
* gcc.target/i386/l_fma_float_4.c: Likewise.
* gcc.target/i386/l_fma_float_5.c: Likewise.
* gcc.target/i386/l_fma_float_6.c: Likewise.
* gcc.target/i386/l_fma_double_1.c: Likewise.
* gcc.target/i386/l_fma_double_2.c: Likewise.
* gcc.target/i386/l_fma_double_3.c: Likewise.
* gcc.target/i386/l_fma_double_4.c: Likewise.
* gcc.target/i386/l_fma_double_5.c: Likewise.
* gcc.target/i386/l_fma_double_6.c: Likewise.

Modified:
trunk/gcc/config/i386/i386.c
trunk/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-1.c
trunk/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-2.c
trunk/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-3.c
trunk/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-4.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_double_1.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_double_2.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_double_3.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_double_4.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_double_5.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_double_6.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_float_1.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_float_2.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_float_3.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_float_4.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_float_5.c
trunk/gcc/testsuite/gcc.target/i386/l_fma_float_6.c


[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2012-02-22 Thread xiaoyuanbo at yeah dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

xiaoyuanbo  changed:

   What|Removed |Added

 CC||xiaoyuanbo at yeah dot net

--- Comment #5 from xiaoyuanbo  2012-02-22 13:04:03 
UTC ---
so you are boss


[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2011-02-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

Richard Guenther  changed:

   What|Removed |Added

 CC||rth at gcc dot gnu.org

--- Comment #4 from Richard Guenther  2011-02-16 
10:49:30 UTC ---
Note that GCC doesn't use unaligned memory operands because it doesn't have
the knowledge implemented that this is ok for AVX, it simply treats the
AVX case the same as the SSE case where the memory operands are required to be
aligned.  That said, unaligned SSE and AVX moves are implemented using
UNSPECs, so they will be never combined with other instructions.  I don't know
if there is a way to still distinguish unaligned and aligned loads/stores
and let them appear as regular RTL moves at the same time.

Richard, is that even possible?


[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2011-02-15 Thread kretz at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #3 from Matthias Kretz  2011-02-15 16:40:38 
UTC ---
ICC??? Whatever, I stopped to trust that compiler long ago:
:
vmovups 0x2039b8(%rip),%xmm0
vmovups 0x2039b4(%rip),%xmm1
vinsertf128 $0x1,0x2039b6(%rip),%ymm0,%ymm2
vinsertf128 $0x1,0x2039b0(%rip),%ymm1,%ymm3
vaddps %ymm3,%ymm2,%ymm4
vmovups %ymm4,0x20399c(%rip)
vzeroupper
retq

:
vmovups 0x203978(%rip),%ymm0
vaddps 0x203974(%rip),%ymm0,%ymm1
vmovups %ymm1,0x203974(%rip)
vzeroupper
retq

Nice optimization of unaligned loads there... not. ???


Just a small side-note for your enjoyment: I wrote a C++ abstraction for SSE;
and with GCC this gives an almost four-fold speedup for Mandelbrot. ICC on the
other hand compiles such awful code that - even with SSE use - it rather
creates a four-fold slowdown compared to the non-SSE code.

GCC really is a nice compiler! Keep on rocking!


[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2011-02-15 Thread kretz at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #2 from Matthias Kretz  2011-02-15 16:31:39 
UTC ---
True, the Optimization Reference Manual and AVX Docs are not very specific
about the performance impact of this. But as far as I understood the docs it
will internally not be slower than an unaligned load + op, but also not faster.
Except, of course, if it's related to memory fetch latency. So it's just about
having more registers available - again AFAIU.

If you want I can try the same testcase on ICC...


[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

2011-02-15 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

Richard Guenther  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-*
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2011.02.15 16:21:49
 Ever Confirmed|0   |1

--- Comment #1 from Richard Guenther  2011-02-15 
16:21:49 UTC ---
Confirmed.  Not sure if it really would not be slower for a non-load/store
instruction to need assist for unaligned loads/stores.