Hi, chen ----- 原始邮件 ----- > 发件人: "chen" <chenm...@163.com> > 收件人: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org> > 发送时间: 星期二, 2019年 12 月 03日 下午 4:59:06 > 主题: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for > filter_column()
> comments inline in code > > > At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote: >>From: Xu Jun <xuju...@sjtu.edu.cn> >> >>+; void filter_column(uint8_t *dst, int height, >>+; float rdiv, float bias, const int *const matrix, >>+; const uint8_t *c[], int length, int radius, >>+; int dstride, int stride); >>+ >>+%if ARCH_X86_64 >>+INIT_XMM sse4 >>+%if UNIX64 >>+cglobal filter_column, 8, 15, 7, dst, height, matrix, ptr, width, rad, >>dstride, >>stride, i, ci, dst_off, off16, c_off, sum, r >>+%else >>+cglobal filter_column, 8, 15, 7, dst, height, rdiv, bias, matrix, ptr, width, >>rad, dstride, stride, i, ci, dst_off, off16, c_off, sum, r > >>+%endif > no idea, these are difficult to read and understand I will rename some variables to make it more readable. Do I need to add some notes here? > > > > >>+ >>+%if WIN64 >>+ SWAP m0, m2 >>+ SWAP m1, m3 >>+ mov r2q, matrixmp >>+ mov r3q, ptrmp >>+ mov r4q, widthmp >>+ mov r5q, radmp >>+ mov r6q, dstridemp >>+ mov r7q, stridemp >>+ DEFINE_ARGS dst, height, matrix, ptr, width, rad, dstride, stride, i, ci, >>dst_off, off16, c_off, sum, r >>+%endif >>+ >>+movsxdifnidn widthq, widthd >>+movsxdifnidn radq, radd >>+movsxdifnidn dstrideq, dstrided >>+movsxdifnidn strideq, strided >>+sal radq, 1 > >>+add radq, 1 ;2*radius+1 > I don't know how about compare to "LEA x,[y*2+1]" > And....I want not discuss in between SAL and SHL > I think lea is better and I will change in the next version. > >>+movsxdifnidn heightq, heightd >>+VBROADCASTSS m0, m0 >>+VBROADCASTSS m1, m1 >>+pxor m6, m6 >>+movss m5, [half] >>+VBROADCASTSS m5, m5 >>+ >>+xor dst_offq, dst_offq >>+xor c_offq, c_offq >>+ >>+.loopy: >>+ xor off16q, off16q >>+ cmp widthq, mmsize/4 >>+ jl .loopr >>+ >>+ mov rq, widthq >>+ and rq, mmsize/4-1 >>+ sub widthq, rq >>+ > >>+ .loop16: ;parallel process 16 elements in a row > Processing 4 column per loop, are you means, we want to save lots of unused > register? > We claim X64, so we have 16 of XMMs Will use more XMMs and process 16 column at a time. > > >>+ pxor m4, m4 >>+ xor iq, iq >>+ .loopi: > >>+ movss m2, [matrixq + 4*iq] > no idea that you working on Float data path, we are lucky, Intel CPU sounds > not > penalty in here. Will change to Interger data path using movd. And movd seems to have less CPI than movss. > > >>+ VBROADCASTSS m2, m2 >>+ mov ciq, [ptrq + iq * gprsize] >>+ movss m3, [ciq + c_offq] ;c[i][y*stride + off16] >>+ punpcklbw m3, m6 > >>+ punpcklwd m3, m6 > Since you claim SSE4, the instruction PMOVZXBD available, moreover, SSE4 > register can be full fill 16 of uint8, but load 4 of them only. I thought that since I would multiply 4 ints, loading 4 uint8s per loop is OK. Now I know that read 16 uint8s and shuffle them is faster. Will change in next version. > >>+ pmulld m2, m3 >>+ paddd m4, m2 >>+ >>+ add iq, 1 > >>+ cmp iq, radq > When you initial iq to radq and decrement per loop, you can reduce one > instruction > I know iq is work as index in the loop, but we can found some trick over > there. Will change in next V. >>+ jl .loopi >>+ >>+ cvtdq2ps m4, m4 >>+ mulps m4, m0 ; sum *= rdiv >>+ addps m4, m1 ; sum += bias > >>+ addps m4, m5 ; sum += 0.5 > I don't know how about precision mismatch if we pre-compute (bias+0.5) Here may not be modified after discussions. > > >>+ cvttps2dq m4, m4 >>+ packssdw m4, m4 >>+ packuswb m4, m4 >>+ movss [dstq + dst_offq], m4 >>+ add c_offq, mmsize/4 >>+ add dst_offq, mmsize/4 >>+ >>+ add off16q, mmsize/4 >>+ cmp off16q, widthq >>+ jl .loop16 >>+ >>+ add widthq, rq >>+ cmp off16q, widthq >>+ jge .paraend >>+ > >>+ .loopr: > no idea about this loop, if we can read beyond, we can reuse above SIMD code Here may not be modified too. Xu Jun > > >>+ xor sumd, sumd >>+ xor iq, iq >>+ .loopr_i: >>+ mov ciq, [ptrq + iq * gprsize] >>+ movzx rd, byte [ciq + c_offq] >>+ imul rd, [matrixq + 4*iq] >>+ add sumd, rd >>+ >>+ add iq, 1 >>+ cmp iq, radq >>+ jl .loopr_i >>+ >>+ pxor m4, m4 >>+ cvtsi2ss m4, sumd >>+ mulss m4, m0 ; sum *= rdiv >>+ addss m4, m1 ; sum += bias >>+ addss m4, m5 ; sum += 0.5 >>+ cvttps2dq m4, m4 >>+ packssdw m4, m4 >>+ packuswb m4, m4 >>+ movd sumd, m4 >>+ mov [dstq + dst_offq], sumb >>+ add c_offq, 1 >>+ add dst_offq, 1 >>+ add off16q, 1 >>+ cmp off16q, widthq >>+ jl .loopr >>+ >>+ .paraend: >>+ sub c_offq, widthq >>+ sub dst_offq, widthq >>+ add c_offq, strideq >>+ add dst_offq, dstrideq >>+ >>+ sub heightq, 1 >>+ cmp heightq, 0 >>+ jg .loopy >>+ >>+.end: >>+ RET > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". -- 敬颂钧安, 徐鋆 电子信息与电气工程学院 上海交通大学 邮箱:xuju...@sjtu.edu.cn 地址:上海市闵行区东川路800号 Yours sincerely, Xu Jun School of Electronic, Information and Electrical Engineering Shanghai Jiao Tong University Email: xuju...@sjtu.edu.cn No. 800, Dongchuan Road, Minhang District, Shanghai 200240, China 宜しくお愿いたします 徐鋆 電子情報と電気工程学院 上海交通大学 メールアドレス :xuju...@sjtu.edu.cn 住所:上海市閔行区ドンチュワンルー800号 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".