Hi, On Fri, Oct 2, 2015 at 5:31 PM, Henrik Gramner <hen...@gramner.com> wrote:
> On Fri, Sep 25, 2015 at 11:24 PM, Ronald S. Bultje <rsbul...@gmail.com> > wrote: > > +++ b/libavcodec/x86/vp9intrapred_16bpp.asm > > > +cglobal vp9_ipred_v_4x4_16, 2, 4, 1, dst, stride, l, a > > +cglobal vp9_ipred_v_8x8_16, 2, 4, 1, dst, stride, l, a > > +cglobal vp9_ipred_v_16x16_16, 2, 4, 2, dst, stride, l, a > > +cglobal vp9_ipred_v_32x32_16, 2, 4, 4, dst, stride, l, a > > Those look pretty generic. Isn't some H.264 pred very similar if not > identical? I didn't check, but if they are you can just use those > instead. Well, they prototype is different. For H/V, it's not critical, but for the directional ones, the edge handling is very quirky so I wanted to do that in C, so l/a are arguments instead of part of the source buffer. (And because we do in-loop filtering, doing V as-is from h264 won't work, since a can be post-loopfilter, whereas in h264 it's required to be pre-, and we don't swap in vp9.) > +cglobal vp9_ipred_h_8x8_16, 3, 4, 5, dst, stride, l, a > > Seemed a bit inefficient so i rewrote it. Around 2x as fast and fewer regs: > > cglobal vp9_ipred_h_8x8_16, 3, 3, 4, dst, stride, l, a > mova m2, [lq] > DEFINE_ARGS dst, stride, stride3 > lea stride3q, [strideq*3] > punpckhwd m3, m2, m2 > pshufd m0, m3, q3333 > pshufd m1, m3, q2222 > mova [dstq+strideq*0], m0 > mova [dstq+strideq*1], m1 > pshufd m0, m3, q1111 > pshufd m1, m3, q0000 > mova [dstq+strideq*2], m0 > mova [dstq+stride3q ], m1 > lea dstq, [dstq+strideq*4] > punpcklwd m2, m2 > pshufd m0, m2, q3333 > pshufd m1, m2, q2222 > mova [dstq+strideq*0], m0 > mova [dstq+strideq*1], m1 > pshufd m0, m2, q1111 > pshufd m1, m2, q0000 > mova [dstq+strideq*2], m0 > mova [dstq+stride3q ], m1 > RET > > > +cglobal vp9_ipred_h_16x16_16, 3, 4, 6, dst, stride, l, a > > +cglobal vp9_ipred_h_32x32_16, 3, 5, 8, dst, stride, l, a > > Should be possible to change those to be more similar to the 8x8 above. > > > +cglobal vp9_ipred_dc_4x4_16, 4, 4, 2, dst, stride, l, a > [...] > > + pshufw m1, m0, q3232 > > + paddd m0, m1 > > + paddd m0, [pd_4] > > Swap the last two rows to allow the shuffle and the pd_4 add to > execute in parallel. The same issue exists in pretty much every other > dc function as well. > > > +cglobal vp9_ipred_dc_32x32_16, 4, 4, 2, dst, stride, l, a > [...] > > +.loop: > > + mova [dstq+strideq*0+ 0], m0 > > + mova [dstq+strideq*0+16], m0 > > + mova [dstq+strideq*0+32], m0 > > + mova [dstq+strideq*0+48], m0 > > + mova [dstq+strideq*1+ 0], m0 > > + mova [dstq+strideq*1+16], m0 > > + mova [dstq+strideq*1+32], m0 > > + mova [dstq+strideq*1+48], m0 > > + mova [dstq+strideq*2+ 0], m0 > > + mova [dstq+strideq*2+16], m0 > > + mova [dstq+strideq*2+32], m0 > > + mova [dstq+strideq*2+48], m0 > > + mova [dstq+stride3q + 0], m0 > > + mova [dstq+stride3q +16], m0 > > + mova [dstq+stride3q +32], m0 > > + mova [dstq+stride3q +48], m0 > > + lea dstq, [dstq+strideq*4] > > + dec cntd > > + jg .loop > > Cut the number of stores per iteration in half and double the number > of iterations instead. > > > +cglobal vp9_ipred_dc_%1_32x32_16, 4, 4, 2, dst, stride, l, a > [...] > > +.loop: > > + mova [dstq+strideq*0+ 0], m0 > > + mova [dstq+strideq*0+16], m0 > > + mova [dstq+strideq*0+32], m0 > > + mova [dstq+strideq*0+48], m0 > > + mova [dstq+strideq*1+ 0], m0 > > + mova [dstq+strideq*1+16], m0 > > + mova [dstq+strideq*1+32], m0 > > + mova [dstq+strideq*1+48], m0 > > + mova [dstq+strideq*2+ 0], m0 > > + mova [dstq+strideq*2+16], m0 > > + mova [dstq+strideq*2+32], m0 > > + mova [dstq+strideq*2+48], m0 > > + mova [dstq+stride3q + 0], m0 > > + mova [dstq+stride3q +16], m0 > > + mova [dstq+stride3q +32], m0 > > + mova [dstq+stride3q +48], m0 > > + lea dstq, [dstq+strideq*4] > > + dec cntd > > + jg .loop > > Ditto. > > > +cglobal vp9_ipred_tm_4x4_10, 4, 4, 6, dst, stride, l, a > [...] > > + movd m0, [aq-2] > > + pshufw m0, m0, q0000 > > Unaligned load penalty, either movd from [aq-4] or pshufw directly from > [aq-8]. > > > +cglobal vp9_ipred_tm_8x8_10, 4, 4, 8, dst, stride, l, a > [...] > > + movd m0, [aq-2] > > + pshuflw m0, m0, q0000 > > Ditto, except you don't want to pshuflw directly from memory in this > case unlike with MMX. You can use vpbroadcastw instead though if you > want to write AVX2. This issue exists in multiple other places as > well. > > > + pshufhw m0, m4, q3333 > > + pshufhw m1, m4, q2222 > > + pshufhw m2, m4, q1111 > > + pshufhw m3, m4, q0000 > > + punpckhqdq m0, m0 > > + punpckhqdq m1, m1 > > + punpckhqdq m2, m2 > > + punpckhqdq m3, m3 > > Use punpckhwd + pshufd instead, same as in vp9_ipred_h_8x8_16 above. All done. Ronald _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel