On Wed, Oct 30, 2013 at 10:47:13AM +0100, Jakub Jelinek wrote:
> Hi!
> 
> Yesterday I've noticed that for AVX which allows unaligned operands in
> AVX arithmetics instructions we still don't combine unaligned loads with the
> AVX arithmetics instructions.  So say for -O2 -mavx -ftree-vectorize
> void
> f1 (int *__restrict e, int *__restrict f)
> {
>   int i;
>   for (i = 0; i < 1024; i++)
>     e[i] = f[i] * 7;
> }
> 
> void
> f2 (int *__restrict e, int *__restrict f)
> {
>   int i;
>   for (i = 0; i < 1024; i++)
>     e[i] = f[i];
> }
> we have:
>         vmovdqu (%rsi,%rax), %xmm0
>         vpmulld %xmm1, %xmm0, %xmm0
>         vmovups %xmm0, (%rdi,%rax)
> in the first loop.  Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT
> *mov<mode>_internal patterns (and various others) use misaligned_operand
> to see if they should emit vmovaps or vmovups (etc.), so as suggested by

That is intentional. In pre-haswell architectures splitting load is
faster than 32 byte load. 

See Intel® 64 and IA-32 Architectures Optimization Reference Manual for
details.

Reply via email to