On 6/22/2016 8:15 PM, Dan Parrot wrote: > On Thu, 2016-06-23 at 01:03 +0200, Michael Niedermayer wrote: >> On Tue, Jun 21, 2016 at 12:04:42AM -0500, Dan Parrot wrote: >>> On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote: >>>> On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote: >>>>> On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote: >>>>>> On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote: >>>>>>> On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote: >>>>>>>> On Sun, Jun 19, 2016 at 09:57:42PM +0000, Dan Parrot wrote: >>>>>>>>> First commit addressing Trac ticket #5570. Functions defined in >>>>>>>>> libswscale/input.c >>>>>>>>> have corresponding SIMD definitions in libswscale/ppc/input_vsx.c >>>>>>>>> --- >>>>>>>>> libswscale/ppc/Makefile | 1 + >>>>>>>>> libswscale/ppc/input_vsx.c | 1070 >>>>>>>>> +++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> libswscale/swscale.c | 3 + >>>>>>>>> libswscale/swscale_internal.h | 1 + >>>>>>>>> 4 files changed, 1075 insertions(+) >>>>>>>>> create mode 100644 libswscale/ppc/input_vsx.c >>>>>>>>> >>>>>>>>> diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile >>>>>>>>> index d1b596e..2482893 100644 >>>>>>>>> --- a/libswscale/ppc/Makefile >>>>>>>>> +++ b/libswscale/ppc/Makefile >>>>>>>>> @@ -1,3 +1,4 @@ >>>>>>>>> OBJS += ppc/swscale_altivec.o >>>>>>>>> \ >>>>>>>>> + ppc/input_vsx.o >>>>>>>>> \ >>>>>>>>> ppc/yuv2rgb_altivec.o >>>>>>>>> \ >>>>>>>>> ppc/yuv2yuv_altivec.o >>>>>>>>> \ >>>>>>>>> diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c >>>>>>>>> new file mode 100644 >>>>>>>>> index 0000000..adb0e38 >>>>>>>>> --- /dev/null >>>>>>>>> +++ b/libswscale/ppc/input_vsx.c >>>>>>>>> @@ -0,0 +1,1070 @@ >>>>>>>>> +/* >>>>>>>>> + * Copyright (C) 2016 Dan Parrot <dan.par...@mail.com> >>>>>>>>> + * >>>>>>>>> + * This file is part of FFmpeg. >>>>>>>>> + * >>>>>>>>> + * FFmpeg is free software; you can redistribute it and/or >>>>>>>>> + * modify it under the terms of the GNU Lesser General Public >>>>>>>>> + * License as published by the Free Software Foundation; either >>>>>>>>> + * version 2.1 of the License, or (at your option) any later version. >>>>>>>>> + * >>>>>>>>> + * FFmpeg is distributed in the hope that it will be useful, >>>>>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >>>>>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >>>>>>>>> + * Lesser General Public License for more details. >>>>>>>>> + * >>>>>>>>> + * You should have received a copy of the GNU Lesser General Public >>>>>>>>> + * License along with FFmpeg; if not, write to the Free Software >>>>>>>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA >>>>>>>>> 02110-1301 USA >>>>>>>>> + */ >>>>>>>>> + >>>>>>>>> +#include <math.h> >>>>>>>>> +#include <stdint.h> >>>>>>>>> +#include <stdio.h> >>>>>>>>> +#include <string.h> >>>>>>>>> + >>>>>>>>> +#include "libavutil/avutil.h" >>>>>>>>> +#include "libavutil/bswap.h" >>>>>>>>> +#include "libavutil/cpu.h" >>>>>>>>> +#include "libavutil/intreadwrite.h" >>>>>>>>> +#include "libavutil/mathematics.h" >>>>>>>>> +#include "libavutil/pixdesc.h" >>>>>>>>> +#include "libavutil/avassert.h" >>>>>>>>> +#include "config.h" >>>>>>>>> +#include "libswscale/rgb2rgb.h" >>>>>>>>> +#include "libswscale/swscale.h" >>>>>>>>> +#include "libswscale/swscale_internal.h" >>>>>>>>> + >>>>>>>>> +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos)) >>>>>>>>> + >>>>>>>>> +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == >>>>>>>>> AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == >>>>>>>>> AV_PIX_FMT_BGRA64LE) ? b_r : r_b) >>>>>>>>> +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == >>>>>>>>> AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == >>>>>>>>> AV_PIX_FMT_BGRA64LE) ? r_b : b_r) >>>>>>>>> + >>>>>>>>> +#if HAVE_VSX >>>>>>>>> + >>>>>>>>> +// This is a SIMD version for IBM POWER8 of function >>>>>>>>> rgb64ToY_c_template >>>>>>>>> +// in file libswscale/input.c >>>>>>>>> +static av_always_inline void >>>>>>>>> +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *src, int >>>>>>>>> width, >>>>>>>>> + enum AVPixelFormat origin, int32_t *rgb2yuv) >>>>>>>>> +{ >>>>>>>>> + int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = >>>>>>>>> rgb2yuv[BY_IDX]; >>>>>>>>> + int i, j; >>>>>>>>> + int num_vec, frag; >>>>>>>>> + >>>>>>>>> + num_vec = width / 8; >>>>>>>>> + frag = width % 8; >>>>>>>>> + >>>>>>>>> + vector int v_ry = vec_splats((int)ry); >>>>>>>>> + vector int v_gy = vec_splats((int)gy); >>>>>>>>> + vector int v_by = vec_splats((int)by); >>>>>>>>> + >>>>>>>>> + int s_opr2; >>>>>>>>> + s_opr2 = (int)(0x2001 << (RGB2YUV_SHIFT-1)); >>>>>>>>> + >>>>>>>>> + vector int v_opr1 = vec_splats((int)RGB2YUV_SHIFT); >>>>>>>>> + vector int v_opr2 = vec_splats((int)s_opr2); >>>>>>>>> + >>>>>>>>> + vector int v_r, v_g, v_b, v_tmp; >>>>>>>>> + vector short v_tmpi, v_dst; >>>>>>>>> + >>>>>>>>> + for (i = 0; i < num_vec; i++) { >>>>>>>>> + for (j = 7; j >= 0 ; j--) { >>>>>>>>> + int r_b = input_pixel(&src[(i*8+j)*4+0]); >>>>>>>>> + int g = input_pixel(&src[(i*8+j)*4+1]); >>>>>>>>> + int b_r = input_pixel(&src[(i*8+j)*4+2]); >>>>>>>>> + >>>>>>>>> + v_r[j % 4] = r; >>>>>>>>> + v_g[j % 4] = g; >>>>>>>>> + v_b[j % 4] = b; >>>>>>>>> + >>>>>>>>> + if (!(j % 4)) { >>>>>>>> ^ >>>>>>>> >>>>>>>>> + v_tmp = v_ry * v_r; >>>>>>>>> + v_tmp = v_tmp + v_gy * v_g; >>>>>>>>> + v_tmp = v_tmp + v_by * v_b; >>>>>>>>> + v_tmp = v_tmp + v_opr2; >>>>>>>>> + v_tmp = vec_sr(v_tmp, (vector unsigned int)v_opr1); >>>>>>>>> + >>>>>>>>> + v_tmpi = (vector short)v_tmp; >>>>>>>>> + v_dst[(j / 4) * 4 + 3] = v_tmpi[6]; >>>>>>>> ^ >>>>>>>> what is the speed of a division and modulo on PPC compared to a >>>>>>>> bitwise and ? >>>>>>>> >>>>>>>> its also not trivial for the compiler to optimize then out as it >>>>>>>> has to proof the varables are never negative >>>>>>>> >>>>>>>> >>>>>>>> [...] >>>>>>>> _______________________________________________ >>>>>>>> ffmpeg-devel mailing list >>>>>>>> ffmpeg-devel@ffmpeg.org >>>>>>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>>>>>> >>>>>>> I don't know the answer to those questions. But, I see your point. >>>>>>> Bitwise operations should be faster than multiplications and divisions. >>>>>>> So, I shall change the code to use bitwise ops and compare the execution >>>>>>> time against its present value (well, average value over multiple runs) >>>>>> >>>>>> it might not make a difference if the compiler does optmize them out >>>>>> but you should know the approximate speed of the instructions >>>>>> that your code is likely/potentially going to use. I mean this is >>>>>> code optimized for PPC. >>>>>> >>>>>> [...] >>>>>> _______________________________________________ >>>>>> ffmpeg-devel mailing list >>>>>> ffmpeg-devel@ffmpeg.org >>>>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>>>> >>>>> I take exception to the tone in that last sentence and I shall respond >>>>> in the same spirit. >>>>> >>>>> 1. I could spend time obtaining the detailed POWER8 micro architecture >>>>> description and compare the execution time of each machine instruction. >>>>> 2. Study gcc source to find out exactly which machine instructions it >>>>> generates for each C language operator >>>>> 3. Use 1 and 2 above to determine which C operators to use here. >>>>> >>>>> OR >>>>> >>>>> I could go ahead and run 2 simulations and compare their average >>>>> execution times. >>>>> >>>>> Seems to me pretty clear which is a better use of time. >>>> >>>> Knowing the execution times of instructions is quite usefull >>>> it certainly takes alot more time to search and read that than to >>>> benchmark once but once you know the timings approximatly you can >>>> roughly guess how fast/slow some code is by just looking. >>>> If knowing that doesnt interrest you then please ignore my comment >>>> to me these things always feel interresting and it was to me certainly >>>> quite usefull when optmizing code to know this kind of stuff for x86 >>>> >>>> also what are you testing on ? >>>> >>>> [...] >>>> _______________________________________________ >>>> ffmpeg-devel mailing list >>>> ffmpeg-devel@ffmpeg.org >>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>> >>> Changing the operations to use bitwise operators instead of >>> multiplications, divisions and modulo arithmetic did not appreciably >>> change execution times. The execution times of the two versions were >>> within 1.5 seconds of each other in all of worst, best and average times >>> reported by command "/usr/bin/time -p make -j 4 fate >>> SAMPLES=fate-suite/" >>> Worst case real components clustered around 310s. Best case times >>> clustered about 296s. >>> >> >>> Do you want me to submit a patch using the bitwise operators to replace >>> the previous patch? >> >> i would slightly prefer that but it doesnt really matter if they are >> the same speed >> >> how much faster is your patch compared to before your patch ? >> >> [...] >> _______________________________________________ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > The averages over 10 runs for "make -j 4 fate" are: > > Pre-patch: 308.0s > Post-patch: 304.5s
Ideally you'd use the timer.h macros to wrap calls to these functions in order to test them alone without all the overhead, or at least try to use "ffmpeg -benchmark -threads 1 -i INPUT -f null -" or similar to decode one file at a time that uses some or all of these functions. As mentioned before, there are tons of variables that can change the results of a make fate run. I assumed in my previous email that you only haven't tested the difference between micro optimizations like div vs shift, but a more specific test between the plain c versions and the simd ones should ideally be done. > > The difference should be amplified when all of libswscale/input.c is > converted to SIMD. The patch has thus far only converted 9 functions to > SIMD, and there are 29 more yet to be done. > > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel