Re: [FFmpeg-devel] [PATCH 5/9] x86: simple_idct10_template: fix overflow in pass

2015-10-13 Thread Christophe Gisquet
Hi,

2015-10-13 2:26 GMT+02:00 Michael Niedermayer :
> On Mon, Oct 12, 2015 at 07:37:46PM +0200, Christophe Gisquet wrote:
>> When the input of a pass has 15 or 16 bits of precision (in particular
>> the column pass), the addition of a bias to W4 may lead to overflows
>> in the input to pmaddwd.
>>
>> This requires postponing the adding of the bias to after the first
>> butterfly. To do so, the fact that m15, unused although zeroed, is
>> exploited. In case the pass is safe, an address can be directly used,
>> and the number of xmm regs can be decreased. Otherwise, the 32bits bias
>> is loaded into it.
>> ---
>>  libavcodec/x86/proresdsp.asm  |  8 
>>  libavcodec/x86/simple_idct10_template.asm | 13 -
>>  2 files changed, 16 insertions(+), 5 deletions(-)
>
> how can i reproduce these overflows ?

Generate the vsynth3-dnxhd-1080i-10bit.mov added after another patch.

Decode it first using faani (you could miss the error).

Now, for the parameters that fail. You know how
(1<<(%pass_bitdepth-1))/W4 is added to the first butterfly. The macro
allows to pass the right pw_ to it (essentially times 4 dw
1<<(%pass_bitdepth-1-14)), or "" and expects to find a
pd_round_%pass_bitdepth (essentially times 4 dd
1<<(%pass_bitdepth-1)). This is indicated in the comments of the
template: "Adding 1<<(%2-1) for >=15 bits values".

Contrast:
"", 13, pw_8, 18, 0, pw_1023 => stddev:1.33 PSNR: 45.61 MAXDIFF:  255
"", 12, pw_16, 19, 0, pw_1023 => stddev:0.33 PSNR: 57.61 MAXDIFF:  255
to the result of the current parameters (no difference)

The same input doesn't cause issue to prores, for some reason,
probably because the mean DC (through times 4 dw 0x2008) is added at
the last pass.

-- 
Christophe
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 5/9] x86: simple_idct10_template: fix overflow in pass

2015-10-13 Thread Michael Niedermayer
On Tue, Oct 13, 2015 at 01:33:07PM +0200, Christophe Gisquet wrote:
> Hi,
> 
> 2015-10-13 13:10 GMT+02:00 Michael Niedermayer :
> > hmm, iam a bit concerned that adding the rounder (which effectively is
> > 0.5) causes a overflow, that would if iam not mistaken imlpy that
> > things are very close to overflowing already without it
> 
> It's true, but the immediate cause here is the input coeffs
> overflowing (ie, once the rounder is added, a positive value is seen
> as a negative one to pmaddwd).

have you tried using saturating additions for the rounder ?


> 
> Otherwise, your concern is basically whether 32bits arithmetics is
> enough: then simple_idct may also overflows on x86_32.

[...]

-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I have often repented speaking, but never of holding my tongue.
-- Xenocrates


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 5/9] x86: simple_idct10_template: fix overflow in pass

2015-10-13 Thread Michael Niedermayer
On Tue, Oct 13, 2015 at 09:01:44AM +0200, Christophe Gisquet wrote:
> Hi,
> 
> 2015-10-13 2:26 GMT+02:00 Michael Niedermayer :
> > On Mon, Oct 12, 2015 at 07:37:46PM +0200, Christophe Gisquet wrote:
> >> When the input of a pass has 15 or 16 bits of precision (in particular
> >> the column pass), the addition of a bias to W4 may lead to overflows
> >> in the input to pmaddwd.
> >>
> >> This requires postponing the adding of the bias to after the first
> >> butterfly. To do so, the fact that m15, unused although zeroed, is
> >> exploited. In case the pass is safe, an address can be directly used,
> >> and the number of xmm regs can be decreased. Otherwise, the 32bits bias
> >> is loaded into it.
> >> ---
> >>  libavcodec/x86/proresdsp.asm  |  8 
> >>  libavcodec/x86/simple_idct10_template.asm | 13 -
> >>  2 files changed, 16 insertions(+), 5 deletions(-)
> >
> > how can i reproduce these overflows ?
> 
> Generate the vsynth3-dnxhd-1080i-10bit.mov added after another patch.
> 
> Decode it first using faani (you could miss the error).
> 
> Now, for the parameters that fail. You know how
> (1<<(%pass_bitdepth-1))/W4 is added to the first butterfly. The macro
> allows to pass the right pw_ to it (essentially times 4 dw
> 1<<(%pass_bitdepth-1-14)), or "" and expects to find a
> pd_round_%pass_bitdepth (essentially times 4 dd
> 1<<(%pass_bitdepth-1)). This is indicated in the comments of the
> template: "Adding 1<<(%2-1) for >=15 bits values".

hmm, iam a bit concerned that adding the rounder (which effectively is
0.5) causes a overflow, that would if iam not mistaken imlpy that
things are very close to overflowing already without it

but either way this patch is needed for the 10bit IDCT code
so applied

thanks

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I do not agree with what you have to say, but I'll defend to the death your
right to say it. -- Voltaire


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 5/9] x86: simple_idct10_template: fix overflow in pass

2015-10-13 Thread Christophe Gisquet
Hi,

2015-10-13 13:10 GMT+02:00 Michael Niedermayer :
> hmm, iam a bit concerned that adding the rounder (which effectively is
> 0.5) causes a overflow, that would if iam not mistaken imlpy that
> things are very close to overflowing already without it

It's true, but the immediate cause here is the input coeffs
overflowing (ie, once the rounder is added, a positive value is seen
as a negative one to pmaddwd).

Otherwise, your concern is basically whether 32bits arithmetics is
enough: then simple_idct may also overflows on x86_32.

-- 
Christophe
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 5/9] x86: simple_idct10_template: fix overflow in pass

2015-10-13 Thread Christophe Gisquet
2015-10-13 15:43 GMT+02:00 Michael Niedermayer :
> On Tue, Oct 13, 2015 at 01:33:07PM +0200, Christophe Gisquet wrote:
>> Hi,
>>
>> 2015-10-13 13:10 GMT+02:00 Michael Niedermayer :
>> > hmm, iam a bit concerned that adding the rounder (which effectively is
>> > 0.5) causes a overflow, that would if iam not mistaken imlpy that
>> > things are very close to overflowing already without it
>>
>> It's true, but the immediate cause here is the input coeffs
>> overflowing (ie, once the rounder is added, a positive value is seen
>> as a negative one to pmaddwd).
>
> have you tried using saturating additions for the rounder ?

This would prevent an absolutely wrong result due to overflow, but
would not be sufficient for bitexactness.

The cases where the bitexactness issue arise may be pathological, not
appearing in fate, though.

But I'll check.

-- 
Christophe
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 5/9] x86: simple_idct10_template: fix overflow in pass

2015-10-12 Thread Christophe Gisquet
When the input of a pass has 15 or 16 bits of precision (in particular
the column pass), the addition of a bias to W4 may lead to overflows
in the input to pmaddwd.

This requires postponing the adding of the bias to after the first
butterfly. To do so, the fact that m15, unused although zeroed, is
exploited. In case the pass is safe, an address can be directly used,
and the number of xmm regs can be decreased. Otherwise, the 32bits bias
is loaded into it.
---
 libavcodec/x86/proresdsp.asm  |  8 
 libavcodec/x86/simple_idct10_template.asm | 13 -
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/libavcodec/x86/proresdsp.asm b/libavcodec/x86/proresdsp.asm
index 18cf15b..3fb71ba 100644
--- a/libavcodec/x86/proresdsp.asm
+++ b/libavcodec/x86/proresdsp.asm
@@ -37,17 +37,17 @@ cextern pw_1019
 
 section .text align=16
 
-%macro idct_put_fn 1
-cglobal prores_idct_put_10, 4, 4, %1
+%macro idct_put_fn 0
+cglobal prores_idct_put_10, 4, 4, 15
 IDCT_PUT_FNpw_1, 15, pw_88, 18, pw_4, pw_1019, r3
 RET
 %endmacro
 
 INIT_XMM sse2
-idct_put_fn 16
+idct_put_fn
 %if HAVE_AVX_EXTERNAL
 INIT_XMM avx
-idct_put_fn 16
+idct_put_fn
 %endif
 
 %endif
diff --git a/libavcodec/x86/simple_idct10_template.asm 
b/libavcodec/x86/simple_idct10_template.asm
index 968d280..e46c83f 100644
--- a/libavcodec/x86/simple_idct10_template.asm
+++ b/libavcodec/x86/simple_idct10_template.asm
@@ -75,6 +75,7 @@ cextern w7_min_w5
 ; a2 -= W6 * row[2];
 ; a3 -= W2 * row[2];
 %ifstr %1
+movam15, [pd_round_ %+ %2]
 %else
 paddw   m10, [%1]
 %endif
@@ -87,6 +88,17 @@ cextern w7_min_w5
 pmaddwd m7,  m1, [w4_min_w2]
 pmaddwd m0, [w4_plus_w2]
 pmaddwd m1, [w4_plus_w2]
+%ifstr %1
+; Adding 1<<(%2-1) for >=15 bits values
+paddd   m2, m15
+paddd   m3, m15
+paddd   m4, m15
+paddd   m5, m15
+paddd   m6, m15
+paddd   m7, m15
+paddd   m0, m15
+paddd   m1, m15
+%endif
 
 ; a0: -1*row[0]-1*row[2]
 ; a1: -1*row[0]
@@ -225,7 +237,6 @@ cextern w7_min_w5
 
 %macro IDCT_PUT_FN 6-7
 movsxd  r1,  r1d
-pxorm15, m15   ; zero
 
 ; for (i = 0; i < 8; i++)
 ; idctRowCondDC(block + i*8);
-- 
2.6.0

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 5/9] x86: simple_idct10_template: fix overflow in pass

2015-10-12 Thread Michael Niedermayer
On Mon, Oct 12, 2015 at 07:37:46PM +0200, Christophe Gisquet wrote:
> When the input of a pass has 15 or 16 bits of precision (in particular
> the column pass), the addition of a bias to W4 may lead to overflows
> in the input to pmaddwd.
> 
> This requires postponing the adding of the bias to after the first
> butterfly. To do so, the fact that m15, unused although zeroed, is
> exploited. In case the pass is safe, an address can be directly used,
> and the number of xmm regs can be decreased. Otherwise, the 32bits bias
> is loaded into it.
> ---
>  libavcodec/x86/proresdsp.asm  |  8 
>  libavcodec/x86/simple_idct10_template.asm | 13 -
>  2 files changed, 16 insertions(+), 5 deletions(-)

how can i reproduce these overflows ?

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Opposition brings concord. Out of discord comes the fairest harmony.
-- Heraclitus


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel