On 27 July 2018 at 12:47, James Darnley <jdarn...@obe.tv> wrote: > On 2018-07-26 17:29, Rostislav Pehlivanov wrote: > > On 26 July 2018 at 12:28, James Darnley <jdarn...@obe.tv> wrote: > > +cglobal vertical_compose_haar_10bit, 3, 6, 4, b0, b1, w > >> + DECLARE_REG_TMP 4,5 > >> + > >> + mova m2, [pd_1] > >> + mov r3d, wd > >> + and wd, ~(mmsize/4 - 1) > >> + shl wd, 2 > >> + add b0q, wq > >> + add b1q, wq > >> + neg wq > >> + > >> + ALIGN 16 > >> + .loop_simd: > >> + mova m0, [b0q + wq] > >> + mova m1, [b1q + wq] > >> + paddd m3, m1, m2 > >> + psrad m3, 1 > >> + psubd m0, m3 > >> + paddd m1, m0 > >> + mova [b0q + wq], m0 > >> + mova [b1q + wq], m1 > >> + add wq, mmsize > >> + jl .loop_simd > >> + > >> + and r3d, mmsize/4 - 1 > >> + jz .end > >> + .loop_scalar: > >> + mov t0d, [b0q] > >> + mov t1d, [b1q] > >> + mov r2d, t1d > >> + add r2d, 1 > >> + sar r2d, 1 > >> + sub t0d, r2d > >> + add t1d, t0d > >> + mov [b0q], t0d > >> + mov [b1q], t1d > >> + > >> + add b0q, 4 > >> + add b1q, 4 > >> + sub r3d, 1 > >> + jg .loop_scalar > >> + > >> + .end: > >> +RET > >> + > >> +%endmacro > >> + > >> +%macro HAAR_HORIZONTAL 0 > >> > > + > >> > > > > Could you remove this newline from every patch? All asm I've written and > > seen keep them without a newline. It made me think there's something in > the > > asm which checked the value of the macro, not that the entire function is > > macro'd. > > What? I don't understand what you mean. Do you think I have too many > blank lines between things? >
Just remove the newline between the macro definition and the cglobal. > +cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, w, > >> x, b2 > >> + DECLARE_REG_TMP 2,5 > >> + %if ARCH_X86_64 > >> + %define tail r6d > >> + %else > >> + %define tail dword wm > >> + %endif > >> + > >> + mova m2, [pd_1] > >> + xor xd, xd > >> + shr wd, 1 > >> + mov tail, wd > >> + lea b2q, [bq + 4*wq] > >> + > >> + ALIGN 16 > >> + .loop_lo: > >> + mova m0, [bq + 4*xq] > >> + movu m1, [b2q + 4*xq] > >> + paddd m1, m2 > >> + psrad m1, 1 > >> + psubd m0, m1 > >> + mova [temp_q + 4*xq], m0 > >> + add xd, mmsize/4 > >> + cmp xd, wd > >> + jl .loop_lo > >> + > >> + xor xd, xd > >> + and wd, ~(mmsize/4 - 1) > >> + > >> + ALIGN 16 > >> + .loop_hi: > >> + mova m0, [temp_q + 4*xq] > >> + movu m1, [b2q + 4*xq] > >> + paddd m1, m0 > >> + paddd m0, m2 > >> + paddd m1, m2 > >> + psrad m0, 1 > >> + psrad m1, 1 > >> + SBUTTERFLY dq, 0,1,3 > >> + %if cpuflag(avx2) > >> + SBUTTERFLY dqqq, 0,1,3 > >> + %endif > >> + mova [bq + 8*xq], m0 > >> + mova [bq + 8*xq + mmsize], m1 > >> + add xd, mmsize/4 > >> + cmp xd, wd > >> + jl .loop_hi > >> + > >> + and tail, mmsize/4 - 1 > >> + jz .end > >> + .loop_scalar: > >> + mov t0d, [temp_q + 4*xq] > >> + mov t1d, [b2q + 4*xq] > >> + add t1d, t0d > >> + add t0d, 1 > >> + add t1d, 1 > >> + sar t0d, 1 > >> + sar t1d, 1 > >> + mov [bq + 8*xq], t0d > >> + mov [bq + 8*xq + 4], t1d > >> + add xq, 1 > >> + sub tail, 1 > >> + jg .loop_scalar > >> + > >> + .end: > >> +REP_RET > >> + > >> +%endmacro > >> + > >> +INIT_XMM sse2 > >> +HAAR_HORIZONTAL > >> +HAAR_VERTICAL > >> + > >> +INIT_XMM avx > >> +HAAR_HORIZONTAL > >> +HAAR_VERTICAL > >> > > > > You're not using any avx functions in that version, not unless a macro'd > > instruction inserts one for you. I think you should remove the avx > version > > then. > > Also since you always have a HAAR_HORIZONTAL and HAAR_VERTICAL macros per > > version you can just make a single macro to do both versions at the same > > time. > > Now that I think about it there will be only one 3-operand instruction > in the SBUTTERFLY and the vertical function also only has 1. I will > remove it. > > I can merge the two macros but I will look back at what I've done > previously. I think it is usually 1 macro per function. > > > + > >> +INIT_YMM avx2 > >> +HAAR_HORIZONTAL > >> +HAAR_VERTICAL > >> diff --git a/libavcodec/x86/dirac_dwt_init_10bit.c > >> b/libavcodec/x86/dirac_dwt_init_10bit.c > >> new file mode 100644 > >> index 0000000000..289862d728 > >> --- /dev/null > >> +++ b/libavcodec/x86/dirac_dwt_init_10bit.c > >> @@ -0,0 +1,76 @@ > >> +/* > >> + * x86 optimized discrete wavelet transform > >> + * Copyright (c) 2018 James Darnley > >> + * > >> + * This file is part of FFmpeg. > >> + * > >> + * FFmpeg is free software; you can redistribute it and/or > >> + * modify it under the terms of the GNU Lesser General Public > >> + * License as published by the Free Software Foundation; either > >> + * version 2.1 of the License, or (at your option) any later version. > >> + * > >> + * FFmpeg is distributed in the hope that it will be useful, > >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of > >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > >> + * Lesser General Public License for more details. > >> + * > >> + * You should have received a copy of the GNU Lesser General Public > >> + * License along with FFmpeg; if not, write to the Free Software > >> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > >> 02110-1301 USA > >> + */ > >> + > >> +#include "libavutil/x86/asm.h" > >> +#include "libavutil/x86/cpu.h" > >> +#include "libavcodec/dirac_dwt.h" > >> + > >> +void ff_horizontal_compose_haar_10bit_sse2(int32_t *b0, int32_t *b1, > int > >> width_align); > >> +void ff_horizontal_compose_haar_10bit_avx(int32_t *b0, int32_t *b1, > int > >> width_align); > >> +void ff_horizontal_compose_haar_10bit_avx2(int32_t *b0, int32_t *b1, > int > >> width_align); > >> + > >> +void ff_vertical_compose_haar_10bit_sse2(int32_t *b0, int32_t *b1, int > >> width_align); > >> +void ff_vertical_compose_haar_10bit_avx(int32_t *b0, int32_t *b1, int > >> width_align); > >> +void ff_vertical_compose_haar_10bit_avx2(int32_t *b0, int32_t *b1, int > >> width_align); > >> + > >> +av_cold void ff_spatial_idwt_init_10bit_x86(DWTContext *d, enum > dwt_type > >> type) > >> +{ > >> +#if HAVE_X86ASM > >> + int cpu_flags = av_get_cpu_flags(); > >> + > >> + if (EXTERNAL_SSE2(cpu_flags)) { > >> + switch (type) { > >> + case DWT_DIRAC_HAAR0: > >> + d->vertical_compose = (void*)ff_vertical_compose_ > >> haar_10bit_sse2; > >> + break; > >> + case DWT_DIRAC_HAAR1: > >> + d->horizontal_compose = (void*)ff_horizontal_compose_ > >> haar_10bit_sse2; > >> + d->vertical_compose = (void*)ff_vertical_compose_ > >> haar_10bit_sse2; > >> + break; > >> + } > >> + } > >> + > >> + if (EXTERNAL_AVX(cpu_flags)) { > >> + switch (type) { > >> + case DWT_DIRAC_HAAR0: > >> + d->vertical_compose = (void*)ff_vertical_compose_ > >> haar_10bit_avx; > >> + break; > >> + case DWT_DIRAC_HAAR1: > >> + d->horizontal_compose = (void*)ff_horizontal_compose_ > >> haar_10bit_avx; > >> + d->vertical_compose = (void*)ff_vertical_compose_ > >> haar_10bit_avx; > >> + break; > >> + } > >> > > > > We keep switches and cases on the same line. > > I had a look and it is the most common style even though not everywhere > holds to it. Also `indent` does it that way. So I will change it and > keep it in mind for the future. > > >> + > >> +%macro HAAR_HORIZONTAL 0 > >> + > >> +cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, > w, > >> x, b2 > >> + DECLARE_REG_TMP 2,5 > >> + %if ARCH_X86_64 > >> + %define tail r6d > >> + %else > >> + %define tail dword wm > >> + %endif > >> + > >> > > > > You can remove this whole bit, the init function only gets called if > > ARCH_X86_64 is true. > > Where did you get that from? I don't require 64-bit for this. > Oh, right, I misread > - if (ARCH_X86 && bit_depth == 8) > +#if ARCH_X86 > + if (bit_depth == 8) > ff_spatial_idwt_init_x86(d, type); > + else if (bit_depth == 10) > + ff_spatial_idwt_init_10bit_x86(d, type); > +#endif > + as ARCH_X86_64, nevermind. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel