2017-12-03 20:36 GMT+01:00 Paul B Mahol <one...@gmail.com>: > On 12/3/17, Martin Vignali <martin.vign...@gmail.com> wrote: > >> > >> In any case, if clang or gcc can generate better code, then the hand > >> written version needs to be optimized to be as fast or faster. > >> > >> > >> > > Quick test : pass checkasm (but probably only because width = 256) > > hflip_byte_c: 26.4 > > hflip_byte_ssse3: 20.4 > > > > > > INIT_XMM ssse3 > > cglobal hflip_byte, 3, 5, 2, src, dst, w, x, v, src2 > > mova m0, [pb_flip_byte] > > xor xq, xq ; <====== > > mov wd, dword wm > > sub wq, mmsize * 2 > > ;remove the cmp here <====== > > jl .skip > > > > .loop0: ; process two xmm in the loop > > neg xq > > movu m1, [srcq + xq - mmsize + 1] > > movu m2, [srcq + xq - mmsize * 2 + 1] <====== > > pshufb m1, m0 > > pshufb m2, m0 <====== > > neg xq > > movu [dstq + xq], m1 > > movu [dstq + xq + mmsize], m2 <====== > > add xq, mmsize * 2 <====== > > cmp xq, wq > > jl .loop0 > > RET ; add RET here > > > > ; MISSING one xmm process if need > > > > .skip: > > add wq, mmsize > > .loop1: > > neg xq > > mov vb, [srcq + xq] > > neg xq > > mov [dstq + xq], vb > > add xq, 1 > > cmp xq, wq > > jl .loop1 > > RET > > So what is wrong now? >
Doesn't see your email, when i send mine. Check asm result with your last patch (and modify for the short version "add xq, mmsize" to "add xq, mmsize * 2") hflip_byte_c: 28.0 hflip_byte_ssse3: 127.5 hflip_short_c: 276.5 hflip_short_ssse3: 100.2 Do you think if you add RET after the end of loop0 , it can work in all cases ? _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel