2017-12-03 20:36 GMT+01:00 Paul B Mahol <one...@gmail.com>:

> On 12/3/17, Martin Vignali <martin.vign...@gmail.com> wrote:
> >>
> >> In any case, if clang or gcc can generate better code, then the hand
> >> written version needs to be optimized to be as fast or faster.
> >>
> >>
> >>
> > Quick test : pass checkasm (but probably only because width = 256)
> > hflip_byte_c: 26.4
> > hflip_byte_ssse3: 20.4
> >
> >
> > INIT_XMM ssse3
> > cglobal hflip_byte, 3, 5, 2, src, dst, w, x, v, src2
> >     mova    m0, [pb_flip_byte]
> >     xor     xq, xq ; <======
> >     mov     wd, dword wm
> >     sub     wq, mmsize * 2
> > ;remove the cmp here <======
> >     jl .skip
> >
> >     .loop0: ; process two xmm in the loop
> >         neg     xq
> >         movu    m1, [srcq + xq - mmsize + 1]
> >         movu    m2, [srcq + xq - mmsize * 2 + 1] <======
> >         pshufb  m1, m0
> >         pshufb  m2, m0 <======
> >         neg     xq
> >         movu    [dstq + xq], m1
> >         movu    [dstq + xq + mmsize], m2 <======
> >         add     xq, mmsize * 2 <======
> >         cmp     xq, wq
> >         jl .loop0
> >      RET ; add RET here
> >
> > ; MISSING one xmm process if need
> >
> > .skip:
> >     add     wq, mmsize
> >     .loop1:
> >         neg    xq
> >         mov    vb, [srcq + xq]
> >         neg    xq
> >         mov    [dstq + xq], vb
> >         add    xq, 1
> >         cmp    xq, wq
> >         jl .loop1
> > RET
>
> So what is wrong now?
>

Doesn't see your email, when i send mine.

Check asm result with your last patch (and modify for the short version
"add     xq, mmsize" to "add     xq, mmsize * 2")
hflip_byte_c: 28.0
hflip_byte_ssse3: 127.5
hflip_short_c: 276.5
hflip_short_ssse3: 100.2


Do you think if you add RET after the end of loop0 , it can work in all
cases ?
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to