Re: [FFmpeg-devel] swscale/arm/yuv2rgb: make the code bitexact with its aarch64 counter part

2016-04-01 Thread Matthieu Bouron
On Fri, Apr 1, 2016 at 4:15 PM, Matthieu Bouron 
wrote:

>
>
> On Mon, Mar 28, 2016 at 9:12 PM, Matthieu Bouron <
> matthieu.bou...@gmail.com> wrote:
>
>>
>>
>> On Sun, Mar 27, 2016 at 5:58 PM, Matthieu Bouron <
>> matthieu.bou...@gmail.com> wrote:
>>
>>>
>>>
>>> On Fri, Mar 25, 2016 at 11:45 PM, Matthieu Bouron <
>>> matthieu.bou...@gmail.com> wrote:
>>>
 The following patchset aims to make bitexact the yuv->rgba armv7 neon
 code path
 with the aarch64 one. It also aims to make the two code bases as close
 as
 possible.

 [PATCH 01/10] swscale/arm/yuv2rgb: remove 32bit code path

 The current 32bit code path which is unused is removed.

 [PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time

 The code process only one line at a time for the yuv420p,nv12 and nv21
 formats
 with no regression in performance observed on a rpi2 (I've even
 observed a
 slight increase of performance for the nv12 and nv21 formats).

 [PATCH 10/10] swscale/arm/yuv2rgb: make the code bitexact with its

 The last patch of the serie makes the code bitexact with the aarch64
 version.
 The increase of precision (which introduces a performance loss) is
 compensated
 by a refactor/optimisation that saves quite a few mov,vdup and vqdmulh.

 ./ffmpeg_g -nostats -f lavfi -i
 testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
 null -

 without patchset :
 [bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399 min:0.020605

 with patchset:
 [bench @ 0xe5f6a0] t:0.018924 avg:0.019075 max:0.037472 min:0.01884
>>>
>>>
>>> I've managed tu run the code on a beagle bone black board, here are the
>>> results:
>>>
>>> nv12->bgra
>>> without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743
>>> max:0.032600 min:0.011513
>>> with patches 01-06/10 applied: [bench @ 0x8052d0] t:0.013438
>>> avg:0.013659 max:0.034427 min:0.013411
>>> with patches 01-10/10 applied: [bench @ 0x1fbb2d0] t:0.012554
>>> avg:0.012751 max:0.034288 min:0.012523
>>>
>>> yuv420p->bgra
>>> without patchset: [bench @ 0x6d42d0] t:0.012954 avg:0.013159
>>> max:0.033866 min:0.012945
>>> with patches 01-06/10 applied: [bench @ 0x20172d0] t:0.015154
>>> avg:0.015358 max:0.036186 min:0.015134
>>> with patches 01-10/10 applied: [bench @ 0x1d162d0] t:0.014623
>>> avg:0.014784 max:0.035487 min:0.014568
>>>
>>> So it looks like processing one line at a time as negative effect on
>>> performance on this board (as opposed to the rpi2). I'll try to keep the
>>> two line processing code and post some result (so we can decide, which
>>> version to choose).
>>>
>>
>> I've managed to update the patchset to keep processing two line at a time
>> for the nv12,nv21 and yuv420p formats, here are the results:
>>
>> ./ffmpeg_g -nostats -f lavfi -i
>> testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
>> null -
>>
>> Beagle bone black:
>> without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743
>> max:0.032600 min:0.011513
>> with patchset v1: [bench @ 0x1fbb2d0] t:0.012554 avg:0.012751
>> max:0.034288 min:0.012523
>> with patchset v2: [bench @ 0x10f92d0] t:0.011239 avg:0.011408
>> max:0.032124 min:0.011202
>>
>> Nexus5:
>> without patchset: avg: ~2,869ms
>> with patchset v1: avg: ~3,008ms
>> with patchset v2: avg: ~2,702ms
>>
>> RPI2:
>> without patchset: [bench @ 0x3eb6a0] t:0.020660 avg:0.020813
>> max:0.039399 min:0.020605
>> with patchset v1:  [bench @ 0xe5f6a0] t:0.018924 avg:0.019075
>> max:0.037472 min:0.01884
>> with patchset v2: [bench @ 0xc1b6a0] t:0.020999 avg:0.021203 max:0.052184
>> min:0.020768
>>
>> Given the following the results, i will drop the current patchset and
>> submit another one (which keeps processing two lines at a time).
>>
>
> I will push the updated patchset (which takes into account Benoit's
> comments) in one hour~.
>

Pushed.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] swscale/arm/yuv2rgb: make the code bitexact with its aarch64 counter part

2016-04-01 Thread Matthieu Bouron
On Mon, Mar 28, 2016 at 9:12 PM, Matthieu Bouron 
wrote:

>
>
> On Sun, Mar 27, 2016 at 5:58 PM, Matthieu Bouron <
> matthieu.bou...@gmail.com> wrote:
>
>>
>>
>> On Fri, Mar 25, 2016 at 11:45 PM, Matthieu Bouron <
>> matthieu.bou...@gmail.com> wrote:
>>
>>> The following patchset aims to make bitexact the yuv->rgba armv7 neon
>>> code path
>>> with the aarch64 one. It also aims to make the two code bases as close as
>>> possible.
>>>
>>> [PATCH 01/10] swscale/arm/yuv2rgb: remove 32bit code path
>>>
>>> The current 32bit code path which is unused is removed.
>>>
>>> [PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time
>>>
>>> The code process only one line at a time for the yuv420p,nv12 and nv21
>>> formats
>>> with no regression in performance observed on a rpi2 (I've even observed
>>> a
>>> slight increase of performance for the nv12 and nv21 formats).
>>>
>>> [PATCH 10/10] swscale/arm/yuv2rgb: make the code bitexact with its
>>>
>>> The last patch of the serie makes the code bitexact with the aarch64
>>> version.
>>> The increase of precision (which introduces a performance loss) is
>>> compensated
>>> by a refactor/optimisation that saves quite a few mov,vdup and vqdmulh.
>>>
>>> ./ffmpeg_g -nostats -f lavfi -i
>>> testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
>>> null -
>>>
>>> without patchset :
>>> [bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399 min:0.020605
>>>
>>> with patchset:
>>> [bench @ 0xe5f6a0] t:0.018924 avg:0.019075 max:0.037472 min:0.01884
>>
>>
>> I've managed tu run the code on a beagle bone black board, here are the
>> results:
>>
>> nv12->bgra
>> without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743
>> max:0.032600 min:0.011513
>> with patches 01-06/10 applied: [bench @ 0x8052d0] t:0.013438 avg:0.013659
>> max:0.034427 min:0.013411
>> with patches 01-10/10 applied: [bench @ 0x1fbb2d0] t:0.012554
>> avg:0.012751 max:0.034288 min:0.012523
>>
>> yuv420p->bgra
>> without patchset: [bench @ 0x6d42d0] t:0.012954 avg:0.013159 max:0.033866
>> min:0.012945
>> with patches 01-06/10 applied: [bench @ 0x20172d0] t:0.015154
>> avg:0.015358 max:0.036186 min:0.015134
>> with patches 01-10/10 applied: [bench @ 0x1d162d0] t:0.014623
>> avg:0.014784 max:0.035487 min:0.014568
>>
>> So it looks like processing one line at a time as negative effect on
>> performance on this board (as opposed to the rpi2). I'll try to keep the
>> two line processing code and post some result (so we can decide, which
>> version to choose).
>>
>
> I've managed to update the patchset to keep processing two line at a time
> for the nv12,nv21 and yuv420p formats, here are the results:
>
> ./ffmpeg_g -nostats -f lavfi -i
> testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
> null -
>
> Beagle bone black:
> without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743 max:0.032600
> min:0.011513
> with patchset v1: [bench @ 0x1fbb2d0] t:0.012554 avg:0.012751 max:0.034288
> min:0.012523
> with patchset v2: [bench @ 0x10f92d0] t:0.011239 avg:0.011408 max:0.032124
> min:0.011202
>
> Nexus5:
> without patchset: avg: ~2,869ms
> with patchset v1: avg: ~3,008ms
> with patchset v2: avg: ~2,702ms
>
> RPI2:
> without patchset: [bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399
> min:0.020605
> with patchset v1:  [bench @ 0xe5f6a0] t:0.018924 avg:0.019075
> max:0.037472 min:0.01884
> with patchset v2: [bench @ 0xc1b6a0] t:0.020999 avg:0.021203 max:0.052184
> min:0.020768
>
> Given the following the results, i will drop the current patchset and
> submit another one (which keeps processing two lines at a time).
>

I will push the updated patchset (which takes into account Benoit's
comments) in one hour~.

Matthieu
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] swscale/arm/yuv2rgb: make the code bitexact with its aarch64 counter part

2016-03-28 Thread Matthieu Bouron
On Sun, Mar 27, 2016 at 5:58 PM, Matthieu Bouron 
wrote:

>
>
> On Fri, Mar 25, 2016 at 11:45 PM, Matthieu Bouron <
> matthieu.bou...@gmail.com> wrote:
>
>> The following patchset aims to make bitexact the yuv->rgba armv7 neon
>> code path
>> with the aarch64 one. It also aims to make the two code bases as close as
>> possible.
>>
>> [PATCH 01/10] swscale/arm/yuv2rgb: remove 32bit code path
>>
>> The current 32bit code path which is unused is removed.
>>
>> [PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time
>>
>> The code process only one line at a time for the yuv420p,nv12 and nv21
>> formats
>> with no regression in performance observed on a rpi2 (I've even observed a
>> slight increase of performance for the nv12 and nv21 formats).
>>
>> [PATCH 10/10] swscale/arm/yuv2rgb: make the code bitexact with its
>>
>> The last patch of the serie makes the code bitexact with the aarch64
>> version.
>> The increase of precision (which introduces a performance loss) is
>> compensated
>> by a refactor/optimisation that saves quite a few mov,vdup and vqdmulh.
>>
>> ./ffmpeg_g -nostats -f lavfi -i
>> testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
>> null -
>>
>> without patchset :
>> [bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399 min:0.020605
>>
>> with patchset:
>> [bench @ 0xe5f6a0] t:0.018924 avg:0.019075 max:0.037472 min:0.01884
>
>
> I've managed tu run the code on a beagle bone black board, here are the
> results:
>
> nv12->bgra
> without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743 max:0.032600
> min:0.011513
> with patches 01-06/10 applied: [bench @ 0x8052d0] t:0.013438 avg:0.013659
> max:0.034427 min:0.013411
> with patches 01-10/10 applied: [bench @ 0x1fbb2d0] t:0.012554 avg:0.012751
> max:0.034288 min:0.012523
>
> yuv420p->bgra
> without patchset: [bench @ 0x6d42d0] t:0.012954 avg:0.013159 max:0.033866
> min:0.012945
> with patches 01-06/10 applied: [bench @ 0x20172d0] t:0.015154 avg:0.015358
> max:0.036186 min:0.015134
> with patches 01-10/10 applied: [bench @ 0x1d162d0] t:0.014623 avg:0.014784
> max:0.035487 min:0.014568
>
> So it looks like processing one line at a time as negative effect on
> performance on this board (as opposed to the rpi2). I'll try to keep the
> two line processing code and post some result (so we can decide, which
> version to choose).
>

I've managed to update the patchset to keep processing two line at a time
for the nv12,nv21 and yuv420p formats, here are the results:

./ffmpeg_g -nostats -f lavfi -i
testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
null -

Beagle bone black:
without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743 max:0.032600
min:0.011513
with patchset v1: [bench @ 0x1fbb2d0] t:0.012554 avg:0.012751 max:0.034288
min:0.012523
with patchset v2: [bench @ 0x10f92d0] t:0.011239 avg:0.011408 max:0.032124
min:0.011202

Nexus5:
without patchset: avg: ~2,869ms
with patchset v1: avg: ~3,008ms
with patchset v2: avg: ~2,702ms

RPI2:
without patchset: [bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399
min:0.020605
with patchset v1:  [bench @ 0xe5f6a0] t:0.018924 avg:0.019075 max:0.037472
min:0.01884
with patchset v2: [bench @ 0xc1b6a0] t:0.020999 avg:0.021203 max:0.052184
min:0.020768

Given the following the results, i will drop the current patchset and
submit another one (which keeps processing two lines at a time).

Matthieu
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] swscale/arm/yuv2rgb: make the code bitexact with its aarch64 counter part

2016-03-27 Thread Matthieu Bouron
On Fri, Mar 25, 2016 at 11:45 PM, Matthieu Bouron  wrote:

> The following patchset aims to make bitexact the yuv->rgba armv7 neon code
> path
> with the aarch64 one. It also aims to make the two code bases as close as
> possible.
>
> [PATCH 01/10] swscale/arm/yuv2rgb: remove 32bit code path
>
> The current 32bit code path which is unused is removed.
>
> [PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time
>
> The code process only one line at a time for the yuv420p,nv12 and nv21
> formats
> with no regression in performance observed on a rpi2 (I've even observed a
> slight increase of performance for the nv12 and nv21 formats).
>
> [PATCH 10/10] swscale/arm/yuv2rgb: make the code bitexact with its
>
> The last patch of the serie makes the code bitexact with the aarch64
> version.
> The increase of precision (which introduces a performance loss) is
> compensated
> by a refactor/optimisation that saves quite a few mov,vdup and vqdmulh.
>
> ./ffmpeg_g -nostats -f lavfi -i
> testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
> null -
>
> without patchset :
> [bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399 min:0.020605
>
> with patchset:
> [bench @ 0xe5f6a0] t:0.018924 avg:0.019075 max:0.037472 min:0.01884


I've managed tu run the code on a beagle bone black board, here are the
results:

nv12->bgra
without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743 max:0.032600
min:0.011513
with patches 01-06/10 applied: [bench @ 0x8052d0] t:0.013438 avg:0.013659
max:0.034427 min:0.013411
with patches 01-10/10 applied: [bench @ 0x1fbb2d0] t:0.012554 avg:0.012751
max:0.034288 min:0.012523

yuv420p->bgra
without patchset: [bench @ 0x6d42d0] t:0.012954 avg:0.013159 max:0.033866
min:0.012945
with patches 01-06/10 applied: [bench @ 0x20172d0] t:0.015154 avg:0.015358
max:0.036186 min:0.015134
with patches 01-10/10 applied: [bench @ 0x1d162d0] t:0.014623 avg:0.014784
max:0.035487 min:0.014568

So it looks like processing one line at a time as negative effect on
performance on this board (as opposed to the rpi2). I'll try to keep the
two line processing code and post some result (so we can decide, which
version to choose).

Matthieu
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] swscale/arm/yuv2rgb: make the code bitexact with its aarch64 counter part

2016-03-25 Thread Matthieu Bouron
The following patchset aims to make bitexact the yuv->rgba armv7 neon code path
with the aarch64 one. It also aims to make the two code bases as close as
possible.

[PATCH 01/10] swscale/arm/yuv2rgb: remove 32bit code path

The current 32bit code path which is unused is removed.

[PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time

The code process only one line at a time for the yuv420p,nv12 and nv21 formats
with no regression in performance observed on a rpi2 (I've even observed a
slight increase of performance for the nv12 and nv21 formats).

[PATCH 10/10] swscale/arm/yuv2rgb: make the code bitexact with its

The last patch of the serie makes the code bitexact with the aarch64 version.
The increase of precision (which introduces a performance loss) is compensated
by a refactor/optimisation that saves quite a few mov,vdup and vqdmulh.

./ffmpeg_g -nostats -f lavfi -i 
testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f null -

without patchset :
[bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399 min:0.020605

with patchset:
[bench @ 0xe5f6a0] t:0.018924 avg:0.019075 max:0.037472 min:0.018846

Matthieu
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel