> which compiles to a single shufps instruction.

Doesn't it often require additional needless  movaps instructions?
For example, the following: 

  asm
  {
    movaps XMM0, a;
    movaps XMM1, b;
    addps  XMM0, XMM1;
    movaps a, XMM0;
  }
  asm
  {
    movaps XMM0, a;
    movaps XMM1, b;
    addps  XMM0, XMM1;
    movaps a, XMM0;
  }

compiles to

movaps -0x48(%rsp),%xmm0
movaps -0x38(%rsp),%xmm1
addps    %xmm1,%xmm0
movaps %xmm0,-0x48(%rsp)
movaps -0x48(%rsp),%xmm0
movaps -0x38(%rsp),%xmm1
addps    %xmm1,%xmm0
movaps %xmm0,-0x48(%rsp)

Is it possible to avoid needlless loading and storing of values when calling 
multiple functions that use asm blocks? It also seems that the compiler doesn't 
inline functions containing asm.

Reply via email to