Vic,

On 6/10/2011 4:16 AM, Martin Fleisz wrote:
I am not quite sure how internally those _mm_* functions work, but if
those are really functions, it will definitely hurt the performance. I
think use assembly SSE2 instruction set directly (like paddw) should be
much better.

Vic

The _mm_* functions are compiler intrinsics and map 1:1 to the
corresponding SSE instructions. It's just a nicer and cleaner interface
to the instruction set (and there is no function call overhead).

-Martin


Martin beat me to it...

The _mm_* function _do_ indeed get compiled down to SSE assembly instructions.

Here is what the function compiles down too:

rfx_decode_YCbCr_to_RGB_SSE2():
  b0:    55                       push   %ebp
  b1:    89 e5                    mov    %esp,%ebp
  b3:    8b 45 08                 mov    0x8(%ebp),%eax
  b6:    8b 4d 0c                 mov    0xc(%ebp),%ecx
  b9:    8b 55 10                 mov    0x10(%ebp),%edx
  bc:    53                       push   %ebx
  bd:    8d 98 00 20 00 00        lea    0x2000(%eax),%ebx
  c3:    90                       nop
  c4:    8d 74 26 00              lea    0x0(%esi,%eiz,1),%esi
  c8:    66 0f 6f 1d 70 00 00     movdqa 0x70,%xmm3
  cf:    00
            cc: R_386_32    .rodata.cst16
  d0:    66 0f fd 18              paddw  (%eax),%xmm3
  d4:    66 0f 6f 12              movdqa (%edx),%xmm2
  d8:    66 0f 6f e2              movdqa %xmm2,%xmm4
  dc:    66 0f 71 e4 02           psraw  $0x2,%xmm4
  e1:    66 0f 6f f2              movdqa %xmm2,%xmm6
  e5:    66 0f 71 e6 03           psraw  $0x3,%xmm6
  ea:    66 0f 6f ea              movdqa %xmm2,%xmm5
  ee:    66 0f 71 e5 05           psraw  $0x5,%xmm5
  f3:    66 0f 6f cb              movdqa %xmm3,%xmm1
  f7:    66 0f fd ca              paddw  %xmm2,%xmm1
  fb:    66 0f 6f 01              movdqa (%ecx),%xmm0
  ff:    66 0f fd cc              paddw  %xmm4,%xmm1
 103:    66 0f ef e4              pxor   %xmm4,%xmm4
 107:    66 0f fd ce              paddw  %xmm6,%xmm1
 10b:    66 0f fd cd              paddw  %xmm5,%xmm1
 10f:    66 0f ee cc              pmaxsw %xmm4,%xmm1
 113:    66 0f 6f e0              movdqa %xmm0,%xmm4
 117:    66 0f 71 e4 02           psraw  $0x2,%xmm4
 11c:    66 0f ea 0d 60 00 00     pminsw 0x60,%xmm1
 123:    00
            120: R_386_32    .rodata.cst16
 124:    66 0f e7 08              movntdq %xmm1,(%eax)
 128:    66 0f 6f c8              movdqa %xmm0,%xmm1
 12c:    66 0f 6f c3              movdqa %xmm3,%xmm0
 130:    66 0f 6f f9              movdqa %xmm1,%xmm7
 134:    66 0f f9 c4              psubw  %xmm4,%xmm0
 138:    66 0f 71 e7 04           psraw  $0x4,%xmm7
 13d:    66 0f f9 c7              psubw  %xmm7,%xmm0
 141:    66 0f 6f f9              movdqa %xmm1,%xmm7
 145:    66 0f 71 e7 05           psraw  $0x5,%xmm7
 14a:    66 0f f9 c7              psubw  %xmm7,%xmm0
 14e:    66 0f 6f fa              movdqa %xmm2,%xmm7
 152:    66 0f 71 e7 01           psraw  $0x1,%xmm7
 157:    66 0f f9 c7              psubw  %xmm7,%xmm0
 15b:    66 0f 71 e2 04           psraw  $0x4,%xmm2
 160:    66 0f f9 c6              psubw  %xmm6,%xmm0
 164:    83 c0 10                 add    $0x10,%eax
 167:    66 0f f9 c2              psubw  %xmm2,%xmm0
 16b:    66 0f ef d2              pxor   %xmm2,%xmm2
 16f:    66 0f f9 c5              psubw  %xmm5,%xmm0
 173:    66 0f ee c2              pmaxsw %xmm2,%xmm0
 177:    66 0f 6f d1              movdqa %xmm1,%xmm2
 17b:    66 0f 71 e2 01           psraw  $0x1,%xmm2
 180:    66 0f ea 05 60 00 00     pminsw 0x60,%xmm0
 187:    00
            184: R_386_32    .rodata.cst16
 188:    66 0f e7 01              movntdq %xmm0,(%ecx)
 18c:    66 0f 6f c3              movdqa %xmm3,%xmm0
 190:    66 0f fd c1              paddw  %xmm1,%xmm0
 194:    66 0f fd c2              paddw  %xmm2,%xmm0
 198:    66 0f 71 e1 06           psraw  $0x6,%xmm1
 19d:    66 0f fd c4              paddw  %xmm4,%xmm0
 1a1:    66 0f ef e4              pxor   %xmm4,%xmm4
 1a5:    83 c1 10                 add    $0x10,%ecx
 1a8:    66 0f fd c1              paddw  %xmm1,%xmm0
 1ac:    66 0f ee c4              pmaxsw %xmm4,%xmm0
 1b0:    66 0f ea 05 60 00 00     pminsw 0x60,%xmm0
 1b7:    00
            1b4: R_386_32    .rodata.cst16
 1b8:    66 0f e7 02              movntdq %xmm0,(%edx)
 1bc:    83 c2 10                 add    $0x10,%edx
 1bf:    39 d8                    cmp    %ebx,%eax
1c1: 0f 85 01 ff ff ff jne c8 <rfx_decode_YCbCr_to_RGB_SSE2+0x18>
 1c7:    5b                       pop    %ebx
 1c8:    5d                       pop    %ebp
 1c9:    c3                       ret


Thanks,
 Steve
------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
_______________________________________________
Freerdp-devel mailing list
Freerdp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freerdp-devel

Reply via email to