Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct

2014-08-04 Thread Ronald S. Bultje
Hi,

On Sun, Aug 3, 2014 at 4:27 PM, Clément Bœsch u...@pkh.me wrote:

 This removes the avcodec dependency and make the code almost twice as
 fast. More to come.

 The DCT factorization is based on Fast and numerically stable
 algorithms for discrete cosine transforms from Gerlind Plonkaa 
 Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).


I have no comments on the patch itself, but can you explain why we're
re-implementing a custom f/idct rather than using the one provided in
lavcodec? It seems to me that going from fixedpoint/simd'ed to float/c
would be slower, not faster, so there must be more to this patch than what
I'm getting from it...

Ronald
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct

2014-08-04 Thread Clément Bœsch
On Mon, Aug 04, 2014 at 11:44:43AM -0400, Ronald S. Bultje wrote:
 Hi,
 
 On Mon, Aug 4, 2014 at 11:42 AM, Clément Bœsch u...@pkh.me wrote:
 
  On Mon, Aug 04, 2014 at 11:17:29AM -0400, Ronald S. Bultje wrote:
   Hi,
  
   On Sun, Aug 3, 2014 at 4:27 PM, Clément Bœsch u...@pkh.me wrote:
  
This removes the avcodec dependency and make the code almost twice as
fast. More to come.
   
The DCT factorization is based on Fast and numerically stable
algorithms for discrete cosine transforms from Gerlind Plonkaa 
Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
  
  
   I have no comments on the patch itself, but can you explain why we're
   re-implementing a custom f/idct rather than using the one provided in
   lavcodec? It seems to me that going from fixedpoint/simd'ed to float/c
   would be slower, not faster, so there must be more to this patch than
  what
   I'm getting from it...
  
 
  OK so as said in private, I didn't find an accurate (not wrongly JPEG
  like I originally said) 16x16 DCT in libavcodec.
 
  You suggested to use the HEVC or VP9 DCT. That's indeed one solution, but
  we currently have only IDCT for those (AFAIK), and I needed a float
  implementation.
 
 
 You mean forward. idct is inverse, fdct is forward, such that
 idct(fdct(data[][])) =~ data[][].

Yeah sure I meant we have only the IDCT and I also needed the FDCT. The
float implementation was another point.

 You can use the forward transforms
 provided in libvpx (for vp9) or x265 (hevc), they're quite precise, and
 already optimized.

Yeah so basically I would have to maintain that port instead of my
implementation, which doesn't look ideal either (the point of using an
existing code in FFmpeg is that its maintenance would have been shared).

Also, in order to use them, it would make sense to port the whole filter
to fixed point. And unfortunately between the IDCT-average part and the
DCT 3x3 for color [de]correlation, I wasn't really confident to do that
regarding the accuracy (which you definitely wants for a non real-time
denoiser)

 
 Ronald

-- 
Clément B.


pgpDLCQj1O8Gg.pgp
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct

2014-08-04 Thread Ronald S. Bultje
Hi,

On Mon, Aug 4, 2014 at 11:59 AM, Clément Bœsch u...@pkh.me wrote:

 On Mon, Aug 04, 2014 at 11:44:43AM -0400, Ronald S. Bultje wrote:
  Hi,
 
  On Mon, Aug 4, 2014 at 11:42 AM, Clément Bœsch u...@pkh.me wrote:
 
   On Mon, Aug 04, 2014 at 11:17:29AM -0400, Ronald S. Bultje wrote:
Hi,
   
On Sun, Aug 3, 2014 at 4:27 PM, Clément Bœsch u...@pkh.me wrote:
   
 This removes the avcodec dependency and make the code almost twice
 as
 fast. More to come.

 The DCT factorization is based on Fast and numerically stable
 algorithms for discrete cosine transforms from Gerlind Plonkaa 
 Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
   
   
I have no comments on the patch itself, but can you explain why we're
re-implementing a custom f/idct rather than using the one provided in
lavcodec? It seems to me that going from fixedpoint/simd'ed to
 float/c
would be slower, not faster, so there must be more to this patch than
   what
I'm getting from it...
   
  
   OK so as said in private, I didn't find an accurate (not wrongly JPEG
   like I originally said) 16x16 DCT in libavcodec.
  
   You suggested to use the HEVC or VP9 DCT. That's indeed one solution,
 but
   we currently have only IDCT for those (AFAIK), and I needed a float
   implementation.
 
 
  You mean forward. idct is inverse, fdct is forward, such that
  idct(fdct(data[][])) =~ data[][].

 Yeah sure I meant we have only the IDCT and I also needed the FDCT. The
 float implementation was another point.

  You can use the forward transforms
  provided in libvpx (for vp9) or x265 (hevc), they're quite precise, and
  already optimized.

 Yeah so basically I would have to maintain that port instead of my
 implementation, which doesn't look ideal either (the point of using an
 existing code in FFmpeg is that its maintenance would have been shared).


Right, but it means one half (idct) is fully shared, with optimizations
etc. - and only one half is unshared-but-forked-with-optimizations.

Whereas right now, it's fully unshared and unoptimized...

I agree the 3x3 makes it a little more tricky, so do whatever you feel is
right; we don't want to have to convert datatypes 4x just so we can fit an
integer fdct/idct pair between an otherwise full float chain. If the
overall ultimate goal is for everything to be int, it makes sense, but I
don't know what the plan is.

Ronald
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct

2014-08-04 Thread Ronald S. Bultje
Hi,

On Mon, Aug 4, 2014 at 11:42 AM, Clément Bœsch u...@pkh.me wrote:

 On Mon, Aug 04, 2014 at 11:17:29AM -0400, Ronald S. Bultje wrote:
  Hi,
 
  On Sun, Aug 3, 2014 at 4:27 PM, Clément Bœsch u...@pkh.me wrote:
 
   This removes the avcodec dependency and make the code almost twice as
   fast. More to come.
  
   The DCT factorization is based on Fast and numerically stable
   algorithms for discrete cosine transforms from Gerlind Plonkaa 
   Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
 
 
  I have no comments on the patch itself, but can you explain why we're
  re-implementing a custom f/idct rather than using the one provided in
  lavcodec? It seems to me that going from fixedpoint/simd'ed to float/c
  would be slower, not faster, so there must be more to this patch than
 what
  I'm getting from it...
 

 OK so as said in private, I didn't find an accurate (not wrongly JPEG
 like I originally said) 16x16 DCT in libavcodec.

 You suggested to use the HEVC or VP9 DCT. That's indeed one solution, but
 we currently have only IDCT for those (AFAIK), and I needed a float
 implementation.


You mean forward. idct is inverse, fdct is forward, such that
idct(fdct(data[][])) =~ data[][]. You can use the forward transforms
provided in libvpx (for vp9) or x265 (hevc), they're quite precise, and
already optimized.

Ronald
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct

2014-08-04 Thread Clément Bœsch
On Sun, Aug 03, 2014 at 01:59:25PM -0700, Timothy Gu wrote:
 On Aug 3, 2014 1:27 PM, Clément Bœsch u...@pkh.me wrote:
 
  This removes the avcodec dependency and make the code almost twice as
  fast. More to come.
 
  The DCT factorization is based on Fast and numerically stable
  algorithms for discrete cosine transforms from Gerlind Plonkaa 
  Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
  ---
   configure |   2 -
   libavfilter/vf_dctdnoiz.c | 328
 +-
   2 files changed, 240 insertions(+), 90 deletions(-)
 
 This change warrants a Changelog entry.
 

Not yet; I'd like to fully optimize it first.

[...]
   static int config_input(AVFilterLink *inlink)
  @@ -194,18 +373,13 @@ static av_cold int init(AVFilterContext *ctx)
   NULL, NULL, NULL, NULL, 0, ctx);
   if (ret  0)
   return ret;
 
  +s-filter_freq_func = filter_freq_expr;
  +} else {
  +s-filter_freq_func = filter_freq_sigma;
   }
 
 Missing if (s-expr)?
 

You mean error checking? That should be handled by ret  0

[...]

-- 
Clément B.


pgpUTTKP9tozl.pgp
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct

2014-08-03 Thread Clément Bœsch
On Sun, Aug 03, 2014 at 10:27:21PM +0200, Clément Bœsch wrote:
 This removes the avcodec dependency and make the code almost twice as
 fast. More to come.
 
 The DCT factorization is based on Fast and numerically stable
 algorithms for discrete cosine transforms from Gerlind Plonkaa 
 Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
 ---
  configure |   2 -
  libavfilter/vf_dctdnoiz.c | 328 
 +-
  2 files changed, 240 insertions(+), 90 deletions(-)
 

BTW, for the details on the factorization: https://github.com/ubitux/dct

I have a few more things planed to improve the performances; typically,
moving from 16x16 to 8x8 with the previous dct made the code something
like 6 or 7x faster, so I expect a similar gain with the new dct (I'll add
a runtime user option for this).

Also, the way color correction/decorrelation is implemented is kind of
very stupid and thus slow, so I'll fix that as well.

I would like to add tests, but the float are probably a no-go, but we
probably have a threshold system in FATE for this? (FUZZ?).

On a side note, I have an experiment with integer operations which worked
with the previous DCT, so maybe that's a solution for FATE (although I'm
uncertain about the performance gain and the precision loss).

Last thing is the threading. While it will definitely helps for the
simple sigma case, it will be a problem for the eval method. I guess I'll
just ignore threading when that option is selected by the user.

[...]

-- 
Clément B.


pgpD08Yt20ZnK.pgp
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct

2014-08-03 Thread Timothy Gu
On Aug 3, 2014 1:27 PM, Clément Bœsch u...@pkh.me wrote:

 This removes the avcodec dependency and make the code almost twice as
 fast. More to come.

 The DCT factorization is based on Fast and numerically stable
 algorithms for discrete cosine transforms from Gerlind Plonkaa 
 Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
 ---
  configure |   2 -
  libavfilter/vf_dctdnoiz.c | 328
+-
  2 files changed, 240 insertions(+), 90 deletions(-)

This change warrants a Changelog entry.


 diff --git a/configure b/configure
 index 9c3af50..6196b2a 100755
 --- a/configure
 +++ b/configure
 @@ -2526,8 +2526,6 @@ boxblur_filter_deps=gpl
  bs2b_filter_deps=libbs2b
  colormatrix_filter_deps=gpl
  cropdetect_filter_deps=gpl
 -dctdnoiz_filter_deps=avcodec
 -dctdnoiz_filter_select=dct
  delogo_filter_deps=gpl
  deshake_filter_deps=avcodec
  deshake_filter_select=me_cmp
 diff --git a/libavfilter/vf_dctdnoiz.c b/libavfilter/vf_dctdnoiz.c
 index 71b8536..6d24934 100644
 --- a/libavfilter/vf_dctdnoiz.c
 +++ b/libavfilter/vf_dctdnoiz.c
 @@ -1,5 +1,5 @@
  /*
 - * Copyright (c) 2013 Clément Bœsch
 + * Copyright (c) 2013-2014 Clément Bœsch
   *
   * This file is part of FFmpeg.
   *
 @@ -23,7 +23,6 @@
   * @see http://www.ipol.im/pub/art/2011/ys-dct/
   */

 -#include libavcodec/avfft.h
  #include libavutil/eval.h
  #include libavutil/opt.h
  #include drawutils.h
 @@ -35,7 +34,7 @@
  static const char *const var_names[] = { c, NULL };
  enum { VAR_C, VAR_VARS_NB };

 -typedef struct {
 +typedef struct DCTdnoizContext {
  const AVClass *class;

  /* coefficient factor expression */
 @@ -52,8 +51,9 @@ typedef struct {
  int p_linesize; // line sizes for color and weights
  int overlap;// number of block overlapping pixels
  int step;   // block step increment (BSIZE - overlap)
 -DCTContext *dct, *idct; // DCT and inverse DCT contexts
 -float *block, *tmp_block;   // two BSIZE x BSIZE block buffers
 +void (*filter_freq_func)(struct DCTdnoizContext *s,
 + const float *src, int src_linesize,
 + float *dst, int dst_linesize);
  } DCTdnoizContext;

  #define OFFSET(x) offsetof(DCTdnoizContext, x)
 @@ -69,66 +69,245 @@ static const AVOption dctdnoiz_options[] = {

  AVFILTER_DEFINE_CLASS(dctdnoiz);

 -static float *dct_block(DCTdnoizContext *ctx, const float *src, int
src_linesize)
 +static void av_always_inline fdct16_1d(float *dst, const float *src,
 +   int dst_stridea, int dst_strideb,
 +   int src_stridea, int src_strideb)
  {
 -int x, y;
 -float *column;
 -
 -for (y = 0; y  BSIZE; y++) {
 -float *line = ctx-block;
 +int i;

 -memcpy(line, src, BSIZE * sizeof(*line));
 -src += src_linesize;
 -av_dct_calc(ctx-dct, line);
 -
 -column = ctx-tmp_block + y;
 -column[0] = line[0] * (1. / sqrt(BSIZE));
 -column += BSIZE;
 -for (x = 1; x  BSIZE; x++) {
 -*column = line[x] * sqrt(2. / BSIZE);
 -column += BSIZE;
 -}
 +for (i = 0; i  BSIZE; i++) {
 +const float x0_0 = src[ 0*src_stridea] + src[15*src_stridea];
 +const float x0_1 = src[ 1*src_stridea] + src[14*src_stridea];
 +const float x0_2 = src[ 2*src_stridea] + src[13*src_stridea];
 +const float x0_3 = src[ 3*src_stridea] + src[12*src_stridea];
 +const float x0_4 = src[ 4*src_stridea] + src[11*src_stridea];
 +const float x0_5 = src[ 5*src_stridea] + src[10*src_stridea];
 +const float x0_6 = src[ 6*src_stridea] + src[ 9*src_stridea];
 +const float x0_7 = src[ 7*src_stridea] + src[ 8*src_stridea];
 +const float x0_8 = src[ 0*src_stridea] - src[15*src_stridea];
 +const float x0_9 = src[ 1*src_stridea] - src[14*src_stridea];
 +const float x0_a = src[ 2*src_stridea] - src[13*src_stridea];
 +const float x0_b = src[ 3*src_stridea] - src[12*src_stridea];
 +const float x0_c = src[ 4*src_stridea] - src[11*src_stridea];
 +const float x0_d = src[ 5*src_stridea] - src[10*src_stridea];
 +const float x0_e = src[ 6*src_stridea] - src[ 9*src_stridea];
 +const float x0_f = src[ 7*src_stridea] - src[ 8*src_stridea];
 +const float x2_0 = x0_0 + x0_7;
 +const float x2_1 = x0_1 + x0_6;
 +const float x2_2 = x0_2 + x0_5;
 +const float x2_3 = x0_3 + x0_4;
 +const float x2_4 = x0_0 - x0_7;
 +const float x2_5 = x0_1 - x0_6;
 +const float x2_6 = x0_2 - x0_5;
 +const float x2_7 = x0_3 - x0_4;
 +const float x4_0 = x2_0 + x2_3;
 +const float x4_1 = x2_1 + x2_2;
 +const float x4_2 = x2_0 - x2_3;
 +const float x4_3 = x2_1 - x2_2;
 +const float x5_0 = x4_0 + x4_1;
 +const float x5_1 = x4_0 - x4_1;
 +

Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct

2014-08-03 Thread Michael Niedermayer
On Sun, Aug 03, 2014 at 10:27:21PM +0200, Clément Bœsch wrote:
 This removes the avcodec dependency and make the code almost twice as
 fast. More to come.
 
 The DCT factorization is based on Fast and numerically stable
 algorithms for discrete cosine transforms from Gerlind Plonkaa 
 Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
 ---
  configure |   2 -
  libavfilter/vf_dctdnoiz.c | 328 
 +-
  2 files changed, 240 insertions(+), 90 deletions(-)
 
 diff --git a/configure b/configure
 index 9c3af50..6196b2a 100755
 --- a/configure
 +++ b/configure
 @@ -2526,8 +2526,6 @@ boxblur_filter_deps=gpl
  bs2b_filter_deps=libbs2b
  colormatrix_filter_deps=gpl
  cropdetect_filter_deps=gpl
 -dctdnoiz_filter_deps=avcodec
 -dctdnoiz_filter_select=dct
  delogo_filter_deps=gpl
  deshake_filter_deps=avcodec
  deshake_filter_select=me_cmp
 diff --git a/libavfilter/vf_dctdnoiz.c b/libavfilter/vf_dctdnoiz.c
 index 71b8536..6d24934 100644
 --- a/libavfilter/vf_dctdnoiz.c
 +++ b/libavfilter/vf_dctdnoiz.c
 @@ -1,5 +1,5 @@
  /*
 - * Copyright (c) 2013 Clément Bœsch
 + * Copyright (c) 2013-2014 Clément Bœsch
   *
   * This file is part of FFmpeg.
   *
 @@ -23,7 +23,6 @@
   * @see http://www.ipol.im/pub/art/2011/ys-dct/
   */
  
 -#include libavcodec/avfft.h
  #include libavutil/eval.h
  #include libavutil/opt.h
  #include drawutils.h
 @@ -35,7 +34,7 @@
  static const char *const var_names[] = { c, NULL };
  enum { VAR_C, VAR_VARS_NB };
  
 -typedef struct {
 +typedef struct DCTdnoizContext {
  const AVClass *class;
  
  /* coefficient factor expression */
 @@ -52,8 +51,9 @@ typedef struct {
  int p_linesize; // line sizes for color and weights
  int overlap;// number of block overlapping pixels
  int step;   // block step increment (BSIZE - overlap)
 -DCTContext *dct, *idct; // DCT and inverse DCT contexts
 -float *block, *tmp_block;   // two BSIZE x BSIZE block buffers
 +void (*filter_freq_func)(struct DCTdnoizContext *s,
 + const float *src, int src_linesize,
 + float *dst, int dst_linesize);
  } DCTdnoizContext;
  
  #define OFFSET(x) offsetof(DCTdnoizContext, x)
 @@ -69,66 +69,245 @@ static const AVOption dctdnoiz_options[] = {
  
  AVFILTER_DEFINE_CLASS(dctdnoiz);
  
 -static float *dct_block(DCTdnoizContext *ctx, const float *src, int 
 src_linesize)
 +static void av_always_inline fdct16_1d(float *dst, const float *src,
 +   int dst_stridea, int dst_strideb,
 +   int src_stridea, int src_strideb)
  {
 -int x, y;
 -float *column;
 -
 -for (y = 0; y  BSIZE; y++) {
 -float *line = ctx-block;
 +int i;
  
 -memcpy(line, src, BSIZE * sizeof(*line));
 -src += src_linesize;
 -av_dct_calc(ctx-dct, line);
 -
 -column = ctx-tmp_block + y;
 -column[0] = line[0] * (1. / sqrt(BSIZE));
 -column += BSIZE;
 -for (x = 1; x  BSIZE; x++) {
 -*column = line[x] * sqrt(2. / BSIZE);
 -column += BSIZE;
 -}
 +for (i = 0; i  BSIZE; i++) {
 +const float x0_0 = src[ 0*src_stridea] + src[15*src_stridea];
 +const float x0_1 = src[ 1*src_stridea] + src[14*src_stridea];
 +const float x0_2 = src[ 2*src_stridea] + src[13*src_stridea];
 +const float x0_3 = src[ 3*src_stridea] + src[12*src_stridea];
 +const float x0_4 = src[ 4*src_stridea] + src[11*src_stridea];
 +const float x0_5 = src[ 5*src_stridea] + src[10*src_stridea];
 +const float x0_6 = src[ 6*src_stridea] + src[ 9*src_stridea];
 +const float x0_7 = src[ 7*src_stridea] + src[ 8*src_stridea];
 +const float x0_8 = src[ 0*src_stridea] - src[15*src_stridea];
 +const float x0_9 = src[ 1*src_stridea] - src[14*src_stridea];
 +const float x0_a = src[ 2*src_stridea] - src[13*src_stridea];
 +const float x0_b = src[ 3*src_stridea] - src[12*src_stridea];
 +const float x0_c = src[ 4*src_stridea] - src[11*src_stridea];
 +const float x0_d = src[ 5*src_stridea] - src[10*src_stridea];
 +const float x0_e = src[ 6*src_stridea] - src[ 9*src_stridea];
 +const float x0_f = src[ 7*src_stridea] - src[ 8*src_stridea];
 +const float x2_0 = x0_0 + x0_7;
 +const float x2_1 = x0_1 + x0_6;
 +const float x2_2 = x0_2 + x0_5;
 +const float x2_3 = x0_3 + x0_4;
 +const float x2_4 = x0_0 - x0_7;
 +const float x2_5 = x0_1 - x0_6;
 +const float x2_6 = x0_2 - x0_5;
 +const float x2_7 = x0_3 - x0_4;
 +const float x4_0 = x2_0 + x2_3;
 +const float x4_1 = x2_1 + x2_2;
 +const float x4_2 = x2_0 - x2_3;
 +const float x4_3 = x2_1 - x2_2;
 +const float x5_0 = x4_0 + x4_1;
 +const float x5_1 = x4_0 - x4_1;
 +const float x5_2