Re: gcc 4.2.3 and MMX to mem move oddity

Prakash Punnoor Fri, 22 Feb 2008 23:51:22 -0800

On the day of Saturday 23 February 2008 Prakash Punnoor hast written:
> On the day of Saturday 23 February 2008 Uros Bizjak hast written:
> > Hello!
> >
> > >   f7:   0f 7f 5c 24 f0          movq   %mm3,-0x10(%rsp)
> > >   fc:   0f 7f 54 24 f8          movq   %mm2,-0x8(%rsp)
> > >  101:   48 8b 5c 24 f8          mov    -0x8(%rsp),%rbx
> > >  106:   48 89 5c 38 40          mov    %rbx,0x40(%rax,%rdi,1)
> > >  10b:   48 8b 5c 24 f0          mov    -0x10(%rsp),%rbx
> > >  110:   48 89 5c 38 48          mov    %rbx,0x48(%rax,%rdi,1)
> > >
> > > As you see in the intrinsic version gcc moves to mmx register to the
> > > stack, reloads from the stack and writes to the destination. Why?
> > >
> > > I don't know whether earlier gcc 4.2 versions produced such stupid
> > > code. Compiling as 32 does similar stupidity, though gcc reloads into a
> > > mmx register...
> >
> > This is a variant of "Strange code for MMX register moves" [1] or its
> > dupe "mmx and movd/movq on x86_64" [2]. Since touching %mm register
> > switches x87 register stack to MMX mode, we penalize mmx moves severely
> > in order to prevent gcc to ever allocate %mm for DImode moves, unless
> > really necessary.
>
> [...]
>
> Just as a side note. The equivalent SSE2 code looks fine, but I have
> question regarding the used store instruction:
>
> #include <emmintrin.h>
> void diff_pixels_mmx4(char *block, const uint8_t *s1, const uint8_t *s2,
> long stride)
> {
>
>       long offset = -128;
>       block+=64;
>       __m128i mm7 = _mm_setzero_si128();
>       do {
>               __m128i mm0 = *(__m128i*)s1;
>               __m128i mm2 = *(__m128i*)s2;
>               __m128i mm1 = mm0;
>               __m128i mm3 = mm2;
>               mm0 = _mm_unpacklo_epi8(mm0, mm7);
>               mm1 = _mm_unpackhi_epi8(mm1, mm7);
>               mm2 = _mm_unpacklo_epi8(mm2, mm7);
>               mm3 = _mm_unpackhi_epi8(mm3, mm7);
>               mm0 = _mm_sub_epi16(mm0, mm2);
>               mm1 = _mm_sub_epi16(mm1, mm3);
>               *(__m128i*)(block+offset) = mm0;
>               *(__m128i*)(block+offset+16) = mm1;
>               s1 += stride;
>               s2 += stride;
>               offset +=32;
>       } while (offset < 0);
> }
>
> generated assembly (-O2 -march=k8):
>
> 0000000000000050 <diff_pixels_mmx4>:
>   50:   66 0f ef e4             pxor   %xmm4,%xmm4
>   54:   48 c7 c0 80 ff ff ff    mov    $0xffffffffffffff80,%rax
>   5b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   60:   66 0f 6f 0e             movdqa (%rsi),%xmm1
>   64:   48 01 ce                add    %rcx,%rsi
>   67:   66 0f 6f 02             movdqa (%rdx),%xmm0
>   6b:   48 01 ca                add    %rcx,%rdx
>   6e:   66 0f 6f d1             movdqa %xmm1,%xmm2
>   72:   66 0f 6f d8             movdqa %xmm0,%xmm3
>   76:   66 0f 68 cc             punpckhbw %xmm4,%xmm1
>   7a:   66 0f 60 d4             punpcklbw %xmm4,%xmm2
>   7e:   66 0f 60 dc             punpcklbw %xmm4,%xmm3
>   82:   66 0f 68 c4             punpckhbw %xmm4,%xmm0
>   86:   66 0f f9 d3             psubw  %xmm3,%xmm2
>   8a:   0f 29 54 38 40          movaps %xmm2,0x40(%rax,%rdi,1)
>   8f:   66 0f f9 c8             psubw  %xmm0,%xmm1
>   93:   0f 29 4c 38 50          movaps %xmm1,0x50(%rax,%rdi,1)
>   98:   48 83 c0 20             add    $0x20,%rax
>   9c:   75 c2                   jne    60 <diff_pixels_mmx4+0x10>
>   9e:   f3 c3                   repz retq
>
> Why is movaps (SSE, floating point data) instead of movdqa (SSE2. integer
> data) used as store? Bug or feature? Even with -O0 compiled it is used.


Testing further: The -march=k8 seems to cause this. Leaving it out, movdqa is 
used, so I guess it is a feature.

-- 
(°=                 =°)
//\ Prakash Punnoor /\\
V_/                 \_V

signature.asc
Description: This is a digitally signed message part.

Re: gcc 4.2.3 and MMX to mem move oddity

Reply via email to