Re: [FFmpeg-devel] libavcodec/exr : add SSE SIMD for reorder_pixels v2 (WIP)

2017-09-03 Thread Martin Vignali
Hello,

Thanks Ivan for your comments and explanations,

---
> > [...]
> > +;**
> 
> > +
> > +%include "libavutil/x86/x86util.asm"
>
> Still missing explicit x86inc.asm
>

if i include x86inc instead of x86util, i have linker error (seems that the
prefixe of func become x264, instead of ff)


> > +
> > +shrsizeq, 1;   sizeq = half_size
> > +mov   r3, sizeq
> > +shr   r3, 4;   r3 = half_size/16 -> loop_simd
> count
> > +
> > +loop_simd:
> > +;initial condition loop
> > +jle  after_loop_simd;  jump to scalar part if loop_simd
> count(r3) is 0
> > +
> > +movdqam0, [srcq];   load first part
> > +movdqum1, [srcq + sizeq];   load second part
>
> Would you test if moving the movdqu first makes any difference in speed?
> I had a similar case and I think that makes it faster,
> since movdqu has bigger latency.
> Might not matter on newer cpu.
>
> (If you can't tell the difference, leave it as it is.)
>

Doesn't notice speed difference.


For the rest of your comments :

You're right, i can remove the scalar part,
the src and dst buffer seems to be padded to 32 in av_fast_padded_malloc
So for the SSE version, can be enough to not overread, overwrite
But need to take care of that, for an avx2 version


I also modify the loop, following your comments.
I offset src, and src2, by half_size, and dst by 2*half_size, so i can
remove some add, sub

and i use half_size * -1, for offset src, src2, and dst

The current asm version is that :  (still WIP, but pass fate test for me)
Need to better check, the max overread, overwrite, for several size value

%include "libavutil/x86/x86util.asm"

SECTION .text

;--
; void ff_reorder_pixels(uint8_t *src, uint8_t *dst, int size)
;--


INIT_XMM sse2
cglobal reorder_pixels, 3,5,3, src, dst, size

add dstq, sizeq;offset dstq by 2* half_size

shrsizeq, 1;   sizeq = half_size
mov   r3, sizeq;   r3 = half_size

add srcq, r3;  offset src by half_size
mov   r4, srcq;r4 is the start of the second
part of the buffer
add   r4, r3;  offset r4 by half_size

neg   r3;  r3 = half_size * -1 (offset of
dst, src, src2 (r4))

loop_simd:
;initial condition loop
jge  end;

movdqa m0, [srcq+r3];load first part
movdqu m1, [r4 +r3] ;load second part

punpcklbw  m2,  m0, m1;  interleaved part 1
movdqa[dstq+r3*2], m2;   copy to dst array

punpckhbw  m0, m1;   interleaved part 2
movdqa [dstq+r3*2+mmsize], m0;   copy to dst array


addr3, mmsize
jmp loop_simd

end:
RET




For the perf, the current state is :

Scalar :
3082024 decicycles in reorder_pixels_zip,  130413 runs,659 skips
bench: utime=115.926s
bench: maxrss=607670272kB

SSE ASM :
296370 decicycles in reorder_pixels_zip,  130946 runs,126 skips
bench: utime=101.481s
bench: maxrss=607698944kB

SSE Intrinsics
289448 decicycles in reorder_pixels_zip,  130944 runs,128 skips
bench: utime=101.417s
bench: maxrss=607694848kB


After taking a look at the asm code generate by clang from intrinsics
version (in O2)

seems like, clang modify the loop_simd part, in order to process twice more
bytes inside the loop
(and it add a condition, to process odd half_size)

I will try to make some test for that, to see if i can have a speed
improvement using the same method


Martin


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] libavcodec/exr : add SSE SIMD for reorder_pixels v2 (WIP)

2017-08-29 Thread Ivan Kalvachev
On 8/26/17, Martin Vignali  wrote:
> Hello,
>
> in attach new patch for SSE simd of reorder pixels in exr decoder (use by
> zip and rle uncompress),
> after comments on the previous patch by Ivan Kalvachev.
>
> After testing only on a small buffer, i fix the overread problem of the
> previous patch (who hid the last loop trouble)
>
> pass fate test for me (on Mac Os X)
>
>
> Tested with the decoding of a sequence of 150 HD Exr images (CGI render
> with 17 layers per file in float pixel, ZIP16 compression)
>
> SSE :
> 349190 decicycles in reorder_pixels_zip,  130716 runs,356 skips
> bench: utime=109.222s
> bench: maxrss=607002624kB
>
> Scalar :
> 3039686 decicycles in reorder_pixels_zip,  130395 runs,677 skips
> bench: utime=123.042s
> bench: maxrss=607019008kB

That's impressive. :)

> Comments Welcome

Since you asked ;)
---
> [...]
> +;**
> +
> +%include "libavutil/x86/x86util.asm"

Still missing explicit x86inc.asm

> +SECTION .text
> +
> +;--
> +; void ff_reorder_pixels_sse2(uint8_t *src, uint8_t *dst, int size)
> +;--
> +
> +
> +INIT_XMM sse2
> +cglobal reorder_pixels, 3,6,3, src, dst, size

You do use r5, but not r4. Change r5 to r4 and decrease the usage above.

> +
> +shrsizeq, 1;   sizeq = half_size
> +mov   r3, sizeq
> +shr   r3, 4;   r3 = half_size/16 -> loop_simd count
> +
> +loop_simd:
> +;initial condition loop
> +jle  after_loop_simd;  jump to scalar part if loop_simd 
> count(r3) is 0
> +
> +movdqam0, [srcq];   load first part
> +movdqum1, [srcq + sizeq];   load second part

Would you test if moving the movdqu first makes any difference in speed?
I had a similar case and I think that makes it faster,
since movdqu has bigger latency.
Might not matter on newer cpu.

(If you can't tell the difference, leave it as it is.)

> +movdqam2, m0;   copy m0
> +
> +punpcklbw m2, m1;   interleaved part 1

Both instructions could be combined as
 punpcklbw m2, m0, m1
And the movdqa would be inserted by the x86inc macro.

> +movdqa[dstq], m2;   copy to dst array
> +
> +punpckhbw m0, m1;   interleaved part 2
> +movdqa [dstq+mmsize], m0;   copy to dst array
> +add dstq, 2*mmsize; inc dst
> +add srcq, mmsize;   inc src
> +sub   r3, 1
> +jmploop_simd

I think there is a way to remove some of these arithmetics.
x86 allows a bit more complicated addressing in the form of
[const+r1+r2*x] where x could be power of two -> 1, 2, 4, 8.

If you use r3 as address offset (aka r3+=mmsize)
you can do:
mov  [dsrq+r3*2+mmsize], m3

This however would mean that you'd need
src2 as separate pointer/register, because
[src1q+sizeq+r3] uses 3 different registers.

You'll also need 2 instruction "add" and "cmp"
at the end of the loop.

Well, there are (at least) 2 ways to combine the add+cmp.

1. If you start with r3=aligned(sizeq)-mmsize ,
and you make descending loop with single "sub r3, mmsize",
until you go negative.

2. A trick used in x264.
You start with r3=-1*aligned(sizeq) and you use "add r3,mmsize"
until you reach zero.
Since r3 is used as offset, you'd have to correct all the pointers too
(aka, add aligned sizeq to src1, src2 and double to dst).

In my tests, on my machine #2 was always slower
(and that's without second scalar loop,
that has to deal with "corrected" pointers).

I also found that accessing memory in descending order
is no worse than ascending, so I've been using
descending loops unless I needed increasing
positive index for some other reason.

I do not request that you use any of these methods,
but you might want to test and see
if any of them is faster for you.

> +after_loop_simd:
> +;scalar part
> +mov   r3, sizeq;r3 = half_size
> +and   r3, 15;   r3 = half_size % 16

mmsize-1 , instead of 15.

> +loop_scalar:
> +;initial condition loop
> +cmp   r3, 0
> +jleend
> +
> +mov  r5b, [srcq];   load byte first part
> +mov   [dstq], r5b;  copy first byte to dst
> +
> +mov  r5b, [srcq + sizeq];   load byte second part
> +mov [dstq+1], r5b;  copy byte second part

In theory doing one write instead of two should be faster.

You can do some evil code here, using high and low byte register access
available from the 16 bit days of x86 (ax={ah,al}). Doing so is not better,

[FFmpeg-devel] libavcodec/exr : add SSE SIMD for reorder_pixels v2 (WIP)

2017-08-26 Thread Martin Vignali
Hello,

in attach new patch for SSE simd of reorder pixels in exr decoder (use by
zip and rle uncompress),
after comments on the previous patch by Ivan Kalvachev.

After testing only on a small buffer, i fix the overread problem of the
previous patch (who hid the last loop trouble)

pass fate test for me (on Mac Os X)


Tested with the decoding of a sequence of 150 HD Exr images (CGI render
with 17 layers per file in float pixel, ZIP16 compression)

SSE :
349190 decicycles in reorder_pixels_zip,  130716 runs,356 skips
bench: utime=109.222s
bench: maxrss=607002624kB

Scalar :
3039686 decicycles in reorder_pixels_zip,  130395 runs,677 skips
bench: utime=123.042s
bench: maxrss=607019008kB


Comments Welcome

Martin
Jokyo Images


0001-libavcodec-add-SSE-SIMD-for-reorder-pixels.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel