Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct
Hi, On Sun, Aug 3, 2014 at 4:27 PM, Clément Bœsch u...@pkh.me wrote: This removes the avcodec dependency and make the code almost twice as fast. More to come. The DCT factorization is based on Fast and numerically stable algorithms for discrete cosine transforms from Gerlind Plonkaa Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015). I have no comments on the patch itself, but can you explain why we're re-implementing a custom f/idct rather than using the one provided in lavcodec? It seems to me that going from fixedpoint/simd'ed to float/c would be slower, not faster, so there must be more to this patch than what I'm getting from it... Ronald ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct
On Mon, Aug 04, 2014 at 11:44:43AM -0400, Ronald S. Bultje wrote: Hi, On Mon, Aug 4, 2014 at 11:42 AM, Clément Bœsch u...@pkh.me wrote: On Mon, Aug 04, 2014 at 11:17:29AM -0400, Ronald S. Bultje wrote: Hi, On Sun, Aug 3, 2014 at 4:27 PM, Clément Bœsch u...@pkh.me wrote: This removes the avcodec dependency and make the code almost twice as fast. More to come. The DCT factorization is based on Fast and numerically stable algorithms for discrete cosine transforms from Gerlind Plonkaa Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015). I have no comments on the patch itself, but can you explain why we're re-implementing a custom f/idct rather than using the one provided in lavcodec? It seems to me that going from fixedpoint/simd'ed to float/c would be slower, not faster, so there must be more to this patch than what I'm getting from it... OK so as said in private, I didn't find an accurate (not wrongly JPEG like I originally said) 16x16 DCT in libavcodec. You suggested to use the HEVC or VP9 DCT. That's indeed one solution, but we currently have only IDCT for those (AFAIK), and I needed a float implementation. You mean forward. idct is inverse, fdct is forward, such that idct(fdct(data[][])) =~ data[][]. Yeah sure I meant we have only the IDCT and I also needed the FDCT. The float implementation was another point. You can use the forward transforms provided in libvpx (for vp9) or x265 (hevc), they're quite precise, and already optimized. Yeah so basically I would have to maintain that port instead of my implementation, which doesn't look ideal either (the point of using an existing code in FFmpeg is that its maintenance would have been shared). Also, in order to use them, it would make sense to port the whole filter to fixed point. And unfortunately between the IDCT-average part and the DCT 3x3 for color [de]correlation, I wasn't really confident to do that regarding the accuracy (which you definitely wants for a non real-time denoiser) Ronald -- Clément B. pgpDLCQj1O8Gg.pgp Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct
Hi, On Mon, Aug 4, 2014 at 11:59 AM, Clément Bœsch u...@pkh.me wrote: On Mon, Aug 04, 2014 at 11:44:43AM -0400, Ronald S. Bultje wrote: Hi, On Mon, Aug 4, 2014 at 11:42 AM, Clément Bœsch u...@pkh.me wrote: On Mon, Aug 04, 2014 at 11:17:29AM -0400, Ronald S. Bultje wrote: Hi, On Sun, Aug 3, 2014 at 4:27 PM, Clément Bœsch u...@pkh.me wrote: This removes the avcodec dependency and make the code almost twice as fast. More to come. The DCT factorization is based on Fast and numerically stable algorithms for discrete cosine transforms from Gerlind Plonkaa Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015). I have no comments on the patch itself, but can you explain why we're re-implementing a custom f/idct rather than using the one provided in lavcodec? It seems to me that going from fixedpoint/simd'ed to float/c would be slower, not faster, so there must be more to this patch than what I'm getting from it... OK so as said in private, I didn't find an accurate (not wrongly JPEG like I originally said) 16x16 DCT in libavcodec. You suggested to use the HEVC or VP9 DCT. That's indeed one solution, but we currently have only IDCT for those (AFAIK), and I needed a float implementation. You mean forward. idct is inverse, fdct is forward, such that idct(fdct(data[][])) =~ data[][]. Yeah sure I meant we have only the IDCT and I also needed the FDCT. The float implementation was another point. You can use the forward transforms provided in libvpx (for vp9) or x265 (hevc), they're quite precise, and already optimized. Yeah so basically I would have to maintain that port instead of my implementation, which doesn't look ideal either (the point of using an existing code in FFmpeg is that its maintenance would have been shared). Right, but it means one half (idct) is fully shared, with optimizations etc. - and only one half is unshared-but-forked-with-optimizations. Whereas right now, it's fully unshared and unoptimized... I agree the 3x3 makes it a little more tricky, so do whatever you feel is right; we don't want to have to convert datatypes 4x just so we can fit an integer fdct/idct pair between an otherwise full float chain. If the overall ultimate goal is for everything to be int, it makes sense, but I don't know what the plan is. Ronald ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct
Hi, On Mon, Aug 4, 2014 at 11:42 AM, Clément Bœsch u...@pkh.me wrote: On Mon, Aug 04, 2014 at 11:17:29AM -0400, Ronald S. Bultje wrote: Hi, On Sun, Aug 3, 2014 at 4:27 PM, Clément Bœsch u...@pkh.me wrote: This removes the avcodec dependency and make the code almost twice as fast. More to come. The DCT factorization is based on Fast and numerically stable algorithms for discrete cosine transforms from Gerlind Plonkaa Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015). I have no comments on the patch itself, but can you explain why we're re-implementing a custom f/idct rather than using the one provided in lavcodec? It seems to me that going from fixedpoint/simd'ed to float/c would be slower, not faster, so there must be more to this patch than what I'm getting from it... OK so as said in private, I didn't find an accurate (not wrongly JPEG like I originally said) 16x16 DCT in libavcodec. You suggested to use the HEVC or VP9 DCT. That's indeed one solution, but we currently have only IDCT for those (AFAIK), and I needed a float implementation. You mean forward. idct is inverse, fdct is forward, such that idct(fdct(data[][])) =~ data[][]. You can use the forward transforms provided in libvpx (for vp9) or x265 (hevc), they're quite precise, and already optimized. Ronald ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct
On Sun, Aug 03, 2014 at 01:59:25PM -0700, Timothy Gu wrote: On Aug 3, 2014 1:27 PM, Clément Bœsch u...@pkh.me wrote: This removes the avcodec dependency and make the code almost twice as fast. More to come. The DCT factorization is based on Fast and numerically stable algorithms for discrete cosine transforms from Gerlind Plonkaa Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015). --- configure | 2 - libavfilter/vf_dctdnoiz.c | 328 +- 2 files changed, 240 insertions(+), 90 deletions(-) This change warrants a Changelog entry. Not yet; I'd like to fully optimize it first. [...] static int config_input(AVFilterLink *inlink) @@ -194,18 +373,13 @@ static av_cold int init(AVFilterContext *ctx) NULL, NULL, NULL, NULL, 0, ctx); if (ret 0) return ret; +s-filter_freq_func = filter_freq_expr; +} else { +s-filter_freq_func = filter_freq_sigma; } Missing if (s-expr)? You mean error checking? That should be handled by ret 0 [...] -- Clément B. pgpUTTKP9tozl.pgp Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct
On Sun, Aug 03, 2014 at 10:27:21PM +0200, Clément Bœsch wrote: This removes the avcodec dependency and make the code almost twice as fast. More to come. The DCT factorization is based on Fast and numerically stable algorithms for discrete cosine transforms from Gerlind Plonkaa Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015). --- configure | 2 - libavfilter/vf_dctdnoiz.c | 328 +- 2 files changed, 240 insertions(+), 90 deletions(-) BTW, for the details on the factorization: https://github.com/ubitux/dct I have a few more things planed to improve the performances; typically, moving from 16x16 to 8x8 with the previous dct made the code something like 6 or 7x faster, so I expect a similar gain with the new dct (I'll add a runtime user option for this). Also, the way color correction/decorrelation is implemented is kind of very stupid and thus slow, so I'll fix that as well. I would like to add tests, but the float are probably a no-go, but we probably have a threshold system in FATE for this? (FUZZ?). On a side note, I have an experiment with integer operations which worked with the previous DCT, so maybe that's a solution for FATE (although I'm uncertain about the performance gain and the precision loss). Last thing is the threading. While it will definitely helps for the simple sigma case, it will be a problem for the eval method. I guess I'll just ignore threading when that option is selected by the user. [...] -- Clément B. pgpD08Yt20ZnK.pgp Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct
On Aug 3, 2014 1:27 PM, Clément Bœsch u...@pkh.me wrote: This removes the avcodec dependency and make the code almost twice as fast. More to come. The DCT factorization is based on Fast and numerically stable algorithms for discrete cosine transforms from Gerlind Plonkaa Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015). --- configure | 2 - libavfilter/vf_dctdnoiz.c | 328 +- 2 files changed, 240 insertions(+), 90 deletions(-) This change warrants a Changelog entry. diff --git a/configure b/configure index 9c3af50..6196b2a 100755 --- a/configure +++ b/configure @@ -2526,8 +2526,6 @@ boxblur_filter_deps=gpl bs2b_filter_deps=libbs2b colormatrix_filter_deps=gpl cropdetect_filter_deps=gpl -dctdnoiz_filter_deps=avcodec -dctdnoiz_filter_select=dct delogo_filter_deps=gpl deshake_filter_deps=avcodec deshake_filter_select=me_cmp diff --git a/libavfilter/vf_dctdnoiz.c b/libavfilter/vf_dctdnoiz.c index 71b8536..6d24934 100644 --- a/libavfilter/vf_dctdnoiz.c +++ b/libavfilter/vf_dctdnoiz.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2013 Clément Bœsch + * Copyright (c) 2013-2014 Clément Bœsch * * This file is part of FFmpeg. * @@ -23,7 +23,6 @@ * @see http://www.ipol.im/pub/art/2011/ys-dct/ */ -#include libavcodec/avfft.h #include libavutil/eval.h #include libavutil/opt.h #include drawutils.h @@ -35,7 +34,7 @@ static const char *const var_names[] = { c, NULL }; enum { VAR_C, VAR_VARS_NB }; -typedef struct { +typedef struct DCTdnoizContext { const AVClass *class; /* coefficient factor expression */ @@ -52,8 +51,9 @@ typedef struct { int p_linesize; // line sizes for color and weights int overlap;// number of block overlapping pixels int step; // block step increment (BSIZE - overlap) -DCTContext *dct, *idct; // DCT and inverse DCT contexts -float *block, *tmp_block; // two BSIZE x BSIZE block buffers +void (*filter_freq_func)(struct DCTdnoizContext *s, + const float *src, int src_linesize, + float *dst, int dst_linesize); } DCTdnoizContext; #define OFFSET(x) offsetof(DCTdnoizContext, x) @@ -69,66 +69,245 @@ static const AVOption dctdnoiz_options[] = { AVFILTER_DEFINE_CLASS(dctdnoiz); -static float *dct_block(DCTdnoizContext *ctx, const float *src, int src_linesize) +static void av_always_inline fdct16_1d(float *dst, const float *src, + int dst_stridea, int dst_strideb, + int src_stridea, int src_strideb) { -int x, y; -float *column; - -for (y = 0; y BSIZE; y++) { -float *line = ctx-block; +int i; -memcpy(line, src, BSIZE * sizeof(*line)); -src += src_linesize; -av_dct_calc(ctx-dct, line); - -column = ctx-tmp_block + y; -column[0] = line[0] * (1. / sqrt(BSIZE)); -column += BSIZE; -for (x = 1; x BSIZE; x++) { -*column = line[x] * sqrt(2. / BSIZE); -column += BSIZE; -} +for (i = 0; i BSIZE; i++) { +const float x0_0 = src[ 0*src_stridea] + src[15*src_stridea]; +const float x0_1 = src[ 1*src_stridea] + src[14*src_stridea]; +const float x0_2 = src[ 2*src_stridea] + src[13*src_stridea]; +const float x0_3 = src[ 3*src_stridea] + src[12*src_stridea]; +const float x0_4 = src[ 4*src_stridea] + src[11*src_stridea]; +const float x0_5 = src[ 5*src_stridea] + src[10*src_stridea]; +const float x0_6 = src[ 6*src_stridea] + src[ 9*src_stridea]; +const float x0_7 = src[ 7*src_stridea] + src[ 8*src_stridea]; +const float x0_8 = src[ 0*src_stridea] - src[15*src_stridea]; +const float x0_9 = src[ 1*src_stridea] - src[14*src_stridea]; +const float x0_a = src[ 2*src_stridea] - src[13*src_stridea]; +const float x0_b = src[ 3*src_stridea] - src[12*src_stridea]; +const float x0_c = src[ 4*src_stridea] - src[11*src_stridea]; +const float x0_d = src[ 5*src_stridea] - src[10*src_stridea]; +const float x0_e = src[ 6*src_stridea] - src[ 9*src_stridea]; +const float x0_f = src[ 7*src_stridea] - src[ 8*src_stridea]; +const float x2_0 = x0_0 + x0_7; +const float x2_1 = x0_1 + x0_6; +const float x2_2 = x0_2 + x0_5; +const float x2_3 = x0_3 + x0_4; +const float x2_4 = x0_0 - x0_7; +const float x2_5 = x0_1 - x0_6; +const float x2_6 = x0_2 - x0_5; +const float x2_7 = x0_3 - x0_4; +const float x4_0 = x2_0 + x2_3; +const float x4_1 = x2_1 + x2_2; +const float x4_2 = x2_0 - x2_3; +const float x4_3 = x2_1 - x2_2; +const float x5_0 = x4_0 + x4_1; +const float x5_1 = x4_0 - x4_1; +
Re: [FFmpeg-devel] [PATCH] avfilter/dctdnoiz: rewrite [f/i]dct
On Sun, Aug 03, 2014 at 10:27:21PM +0200, Clément Bœsch wrote: This removes the avcodec dependency and make the code almost twice as fast. More to come. The DCT factorization is based on Fast and numerically stable algorithms for discrete cosine transforms from Gerlind Plonkaa Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015). --- configure | 2 - libavfilter/vf_dctdnoiz.c | 328 +- 2 files changed, 240 insertions(+), 90 deletions(-) diff --git a/configure b/configure index 9c3af50..6196b2a 100755 --- a/configure +++ b/configure @@ -2526,8 +2526,6 @@ boxblur_filter_deps=gpl bs2b_filter_deps=libbs2b colormatrix_filter_deps=gpl cropdetect_filter_deps=gpl -dctdnoiz_filter_deps=avcodec -dctdnoiz_filter_select=dct delogo_filter_deps=gpl deshake_filter_deps=avcodec deshake_filter_select=me_cmp diff --git a/libavfilter/vf_dctdnoiz.c b/libavfilter/vf_dctdnoiz.c index 71b8536..6d24934 100644 --- a/libavfilter/vf_dctdnoiz.c +++ b/libavfilter/vf_dctdnoiz.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2013 Clément Bœsch + * Copyright (c) 2013-2014 Clément Bœsch * * This file is part of FFmpeg. * @@ -23,7 +23,6 @@ * @see http://www.ipol.im/pub/art/2011/ys-dct/ */ -#include libavcodec/avfft.h #include libavutil/eval.h #include libavutil/opt.h #include drawutils.h @@ -35,7 +34,7 @@ static const char *const var_names[] = { c, NULL }; enum { VAR_C, VAR_VARS_NB }; -typedef struct { +typedef struct DCTdnoizContext { const AVClass *class; /* coefficient factor expression */ @@ -52,8 +51,9 @@ typedef struct { int p_linesize; // line sizes for color and weights int overlap;// number of block overlapping pixels int step; // block step increment (BSIZE - overlap) -DCTContext *dct, *idct; // DCT and inverse DCT contexts -float *block, *tmp_block; // two BSIZE x BSIZE block buffers +void (*filter_freq_func)(struct DCTdnoizContext *s, + const float *src, int src_linesize, + float *dst, int dst_linesize); } DCTdnoizContext; #define OFFSET(x) offsetof(DCTdnoizContext, x) @@ -69,66 +69,245 @@ static const AVOption dctdnoiz_options[] = { AVFILTER_DEFINE_CLASS(dctdnoiz); -static float *dct_block(DCTdnoizContext *ctx, const float *src, int src_linesize) +static void av_always_inline fdct16_1d(float *dst, const float *src, + int dst_stridea, int dst_strideb, + int src_stridea, int src_strideb) { -int x, y; -float *column; - -for (y = 0; y BSIZE; y++) { -float *line = ctx-block; +int i; -memcpy(line, src, BSIZE * sizeof(*line)); -src += src_linesize; -av_dct_calc(ctx-dct, line); - -column = ctx-tmp_block + y; -column[0] = line[0] * (1. / sqrt(BSIZE)); -column += BSIZE; -for (x = 1; x BSIZE; x++) { -*column = line[x] * sqrt(2. / BSIZE); -column += BSIZE; -} +for (i = 0; i BSIZE; i++) { +const float x0_0 = src[ 0*src_stridea] + src[15*src_stridea]; +const float x0_1 = src[ 1*src_stridea] + src[14*src_stridea]; +const float x0_2 = src[ 2*src_stridea] + src[13*src_stridea]; +const float x0_3 = src[ 3*src_stridea] + src[12*src_stridea]; +const float x0_4 = src[ 4*src_stridea] + src[11*src_stridea]; +const float x0_5 = src[ 5*src_stridea] + src[10*src_stridea]; +const float x0_6 = src[ 6*src_stridea] + src[ 9*src_stridea]; +const float x0_7 = src[ 7*src_stridea] + src[ 8*src_stridea]; +const float x0_8 = src[ 0*src_stridea] - src[15*src_stridea]; +const float x0_9 = src[ 1*src_stridea] - src[14*src_stridea]; +const float x0_a = src[ 2*src_stridea] - src[13*src_stridea]; +const float x0_b = src[ 3*src_stridea] - src[12*src_stridea]; +const float x0_c = src[ 4*src_stridea] - src[11*src_stridea]; +const float x0_d = src[ 5*src_stridea] - src[10*src_stridea]; +const float x0_e = src[ 6*src_stridea] - src[ 9*src_stridea]; +const float x0_f = src[ 7*src_stridea] - src[ 8*src_stridea]; +const float x2_0 = x0_0 + x0_7; +const float x2_1 = x0_1 + x0_6; +const float x2_2 = x0_2 + x0_5; +const float x2_3 = x0_3 + x0_4; +const float x2_4 = x0_0 - x0_7; +const float x2_5 = x0_1 - x0_6; +const float x2_6 = x0_2 - x0_5; +const float x2_7 = x0_3 - x0_4; +const float x4_0 = x2_0 + x2_3; +const float x4_1 = x2_1 + x2_2; +const float x4_2 = x2_0 - x2_3; +const float x4_3 = x2_1 - x2_2; +const float x5_0 = x4_0 + x4_1; +const float x5_1 = x4_0 - x4_1; +const float x5_2