http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

           Summary: [missed optimization] AVX allows unaligned memory
                    operands but GCC uses unaligned load and register
                    operand
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: target
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: kr...@kde.org


According to the AVX docs: "With the exception of explicitly aligned 16 or 32
byte SIMD load/store instructions, most VEX-encoded, arithmetic and data
processing instructions operate in a flexible environment regarding memory
address alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load
semantics will support unaligned load operation by default. Memory arguments
for most instructions with VEX prefix operate normally without causing #GP(0)
on any byte-granularity alignment (unlike Legacy SSE instructions)."

I tested whether GCC would take advantage of this, and found that it doesn't:

_mm256_store_ps(&data[3],
  _mm256_add_ps(_mm256_load_ps(&data[0]), _mm256_load_ps(&data[1]))
);
compiles to:
vmovaps 0x200b18(%rip),%ymm0
vaddps 0x200b13(%rip),%ymm0,%ymm0
vmovaps %ymm0,0x200b10(%rip)

whereas

_mm256_store_ps(&data[3],
  _mm256_add_ps(_mm256_loadu_ps(&data[0]), _mm256_loadu_ps(&data[1]))
);
compiles to:
vmovups 0x200b4c(%rip),%ymm0
vmovups 0x200b40(%rip),%ymm1
vaddps %ymm0,%ymm1,%ymm0
vmovaps %ymm0,0x200b3c(%rip)

GCC could use a memory operand in the vaddps here instead. According to the AVX
docs, this doesn't hurt performance. But it reduces register pressure AFAIU.

Would be nice to have.

Reply via email to