Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-02-20 Thread Mark Thompson
On 20/02/2019 10:17, Wang, Shaofei wrote:
>> -Original Message-
>> From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf Of
>> Mark Thompson
>> Sent: Saturday, February 16, 2019 8:12 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1
>> decode + N filter graphs and adaptive bitrate.
>> On 15/02/2019 21:54, Shaofei Wang wrote:
>>> It enabled multiple filter graph concurrency, which bring above about
>>> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
>>>
>>> Below are some test cases and comparison as reference.
>>> (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
>>> (Software: Intel iHD driver - 16.9.00100, CentOS 7)
>>>
>>> For 1:N transcode by GPU acceleration with vaapi:
>>> ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
>>> -hwaccel_output_format vaapi \
>>> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>> -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
>>> -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null
>>>
>>> test results:
>>> 2 encoders 5 encoders 10 encoders
>>> Improved   6.1%6.9%   5.5%
>>>
>>> For 1:N transcode by GPU acceleration with QSV:
>>> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
>>> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>> -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null
>> \
>>> -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null
>>> /dev/null
>>>
>>> test results:
>>> 2 encoders  5 encoders 10 encoders
>>> Improved   6%   4% 15%
>>>
>>> For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
>>> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
>>> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>> -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f
>> null /dev/null \
>>> -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f
>>> null /dev/null
>>>
>>> test results:
>>> 2 scale  5 scale   10 scale
>>> Improved   12% 21%21%
>>>
>>> For CPU only 1 decode to N scaling:
>>> ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>> -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
>>> -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null
>>>
>>> test results:
>>> 2 scale  5 scale   10 scale
>>> Improved   25%107%   148%
>>>
>>> Signed-off-by: Wang, Shaofei 
>>> Reviewed-by: Zhao, Jun 
>>> ---
>>>  fftools/ffmpeg.c| 121
>> 
>>>  fftools/ffmpeg.h|  14 ++
>>>  fftools/ffmpeg_filter.c |   1 +
>>>  3 files changed, 128 insertions(+), 8 deletions(-)
>>
>> On a bit more review, I don't think this patch works at all.
>>
> It has been tested and verified by a lot of cases. More fate cases need to be 
> covered now.
> 
>> The existing code is all written to be run serially.  This simplistic 
>> approach to
>> parallelising it falls down because many of those functions use variables
>> written in what were previously other functions called at different times but
>> have now become other threads, introducing undefined behaviour due to
>> data races.
>>
> Actually, this is not a patch to parallel every thing in the ffmpeg. It just 
> thread the input filter
> of the filter graph(tend for simple filter graph), which is a simple way to 
> improve N filter graph
> performance and also without introduce huge modification. So that there is 
> still a lot of serial function call, differences
> are that each filter graph need to init its output stream instead of init all 
> together and each
> filter graph will reap filters for its filter chain.

Indeed the existing encapsulation tries to keep things mostly separate, but in 
various places it accesses shared state which works fine in the serial case but 
fails when those parts are run in parallel.

Data races are undefined behaviour in C; introducing them is not acceptable.

>> To consider a single example (not the only one), the function
>> check_init_output_file() does not work at all after this change.  The test 
>> for
>> OutputStream initialisation (so that you run exactly once after all of the
>> output streams are ready) races with other threads setting those variables.
>> Since that's undefined behaviour you may get lucky sometimes and have the
>> output file initialisation run exactly once, but in general it will fail in 
>> unknown
>> ways.
>>
> 
> The check_init_output_file() should be responsible for the output file 
> related with
> specified output stream which managed by each thread chain, that means even it
> called by different thread, the data setting are different. Let me double 
> check.

Each output file can contain multiple streams - try a transcode with both audio 
and video filters.

(Incidentally, the patch as-is also crashes in this case if the transcode 
completes 

Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-02-20 Thread Wang, Shaofei
> -Original Message-
> From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf Of
> Mark Thompson
> Sent: Saturday, February 16, 2019 8:12 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1
> decode + N filter graphs and adaptive bitrate.
> On 15/02/2019 21:54, Shaofei Wang wrote:
> > It enabled multiple filter graph concurrency, which bring above about
> > 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> >
> > Below are some test cases and comparison as reference.
> > (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
> > (Software: Intel iHD driver - 16.9.00100, CentOS 7)
> >
> > For 1:N transcode by GPU acceleration with vaapi:
> > ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
> > -hwaccel_output_format vaapi \
> > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
> > -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
> > -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null
> >
> > test results:
> > 2 encoders 5 encoders 10 encoders
> > Improved   6.1%6.9%   5.5%
> >
> > For 1:N transcode by GPU acceleration with QSV:
> > ./ffmpeg -hwaccel qsv -c:v h264_qsv \
> > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
> > -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null
> \
> > -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null
> > /dev/null
> >
> > test results:
> > 2 encoders  5 encoders 10 encoders
> > Improved   6%   4% 15%
> >
> > For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
> > ./ffmpeg -hwaccel qsv -c:v h264_qsv \
> > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
> > -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f
> null /dev/null \
> > -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f
> > null /dev/null
> >
> > test results:
> > 2 scale  5 scale   10 scale
> > Improved   12% 21%21%
> >
> > For CPU only 1 decode to N scaling:
> > ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
> > -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
> > -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null
> >
> > test results:
> > 2 scale  5 scale   10 scale
> > Improved   25%107%   148%
> >
> > Signed-off-by: Wang, Shaofei 
> > Reviewed-by: Zhao, Jun 
> > ---
> >  fftools/ffmpeg.c| 121
> 
> >  fftools/ffmpeg.h|  14 ++
> >  fftools/ffmpeg_filter.c |   1 +
> >  3 files changed, 128 insertions(+), 8 deletions(-)
> 
> On a bit more review, I don't think this patch works at all.
> 
It has been tested and verified by a lot of cases. More fate cases need to be 
covered now.

> The existing code is all written to be run serially.  This simplistic 
> approach to
> parallelising it falls down because many of those functions use variables
> written in what were previously other functions called at different times but
> have now become other threads, introducing undefined behaviour due to
> data races.
>
Actually, this is not a patch to parallel every thing in the ffmpeg. It just 
thread the input filter
of the filter graph(tend for simple filter graph), which is a simple way to 
improve N filter graph
performance and also without introduce huge modification. So that there is 
still a lot of serial function call, differences
are that each filter graph need to init its output stream instead of init all 
together and each
filter graph will reap filters for its filter chain.

> To consider a single example (not the only one), the function
> check_init_output_file() does not work at all after this change.  The test for
> OutputStream initialisation (so that you run exactly once after all of the
> output streams are ready) races with other threads setting those variables.
> Since that's undefined behaviour you may get lucky sometimes and have the
> output file initialisation run exactly once, but in general it will fail in 
> unknown
> ways.
> 

The check_init_output_file() should be responsible for the output file related 
with
specified output stream which managed by each thread chain, that means even it
called by different thread, the data setting are different. Let me double check.

> If you want to resubmit this patch, you will need to refactor a lot of the 
> other
> code in ffmpeg.c to rule out these undefined cases.
> 
OK. This patch would only effect on SIMPLE filter graph.

> - Mark
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-02-17 Thread Wang, Shaofei
Thanks. Seems I need to cover external samples either

> -Original Message-
> From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf Of
> Michael Niedermayer
> Sent: Saturday, February 16, 2019 5:22 AM
> To: FFmpeg development discussions and patches 
> Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1
> decode + N filter graphs and adaptive bitrate.
> 
> On Fri, Feb 15, 2019 at 04:54:23PM -0500, Shaofei Wang wrote:
> > It enabled multiple filter graph concurrency, which bring above about
> > 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> >
> > Below are some test cases and comparison as reference.
> > (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
> > (Software: Intel iHD driver - 16.9.00100, CentOS 7)
> 
> breaks fate
> 
> make -j12 fate-filter-overlay-dvdsub-2397 V=2
> 
> 
> frame=  208 fps=0.0 q=-0.0 Lsize=  48kB time=00:00:08.04 bitrate=
> 49.0kbits/s speed=10.6x
> video:105300kB audio:1254kB subtitle:0kB other streams:0kB global
> headers:0kB muxing overhead: unknown pthread_join failed with error: No
> such process Aborted (core dumped)
> make: *** [fate-filter-overlay-dvdsub-2397] Error 134
> 
> [...]
> --
> Michael GnuPG fingerprint:
> 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> Asymptotically faster algorithms should always be preferred if you have
> asymptotical amounts of data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-02-16 Thread Mark Thompson
On 15/02/2019 21:54, Shaofei Wang wrote:
> It enabled multiple filter graph concurrency, which bring above about
> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> 
> Below are some test cases and comparison as reference.
> (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
> (Software: Intel iHD driver - 16.9.00100, CentOS 7)
> 
> For 1:N transcode by GPU acceleration with vaapi:
> ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
> -hwaccel_output_format vaapi \
> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
> -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
> -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null
> 
> test results:
> 2 encoders 5 encoders 10 encoders
> Improved   6.1%6.9%   5.5%
> 
> For 1:N transcode by GPU acceleration with QSV:
> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
> -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
> -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null
> 
> test results:
> 2 encoders  5 encoders 10 encoders
> Improved   6%   4% 15%
> 
> For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
> -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
> /dev/null \
> -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
> /dev/null
> 
> test results:
> 2 scale  5 scale   10 scale
> Improved   12% 21%21%
> 
> For CPU only 1 decode to N scaling:
> ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
> -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
> -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null
> 
> test results:
> 2 scale  5 scale   10 scale
> Improved   25%107%   148%
> 
> Signed-off-by: Wang, Shaofei 
> Reviewed-by: Zhao, Jun 
> ---
>  fftools/ffmpeg.c| 121 
> 
>  fftools/ffmpeg.h|  14 ++
>  fftools/ffmpeg_filter.c |   1 +
>  3 files changed, 128 insertions(+), 8 deletions(-)

On a bit more review, I don't think this patch works at all.

The existing code is all written to be run serially.  This simplistic approach 
to parallelising it falls down because many of those functions use variables 
written in what were previously other functions called at different times but 
have now become other threads, introducing undefined behaviour due to data 
races.

To consider a single example (not the only one), the function 
check_init_output_file() does not work at all after this change.  The test for 
OutputStream initialisation (so that you run exactly once after all of the 
output streams are ready) races with other threads setting those variables.  
Since that's undefined behaviour you may get lucky sometimes and have the 
output file initialisation run exactly once, but in general it will fail in 
unknown ways.

If you want to resubmit this patch, you will need to refactor a lot of the 
other code in ffmpeg.c to rule out these undefined cases.

- Mark
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-02-15 Thread Michael Niedermayer
On Fri, Feb 15, 2019 at 04:54:23PM -0500, Shaofei Wang wrote:
> It enabled multiple filter graph concurrency, which bring above about
> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> 
> Below are some test cases and comparison as reference.
> (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
> (Software: Intel iHD driver - 16.9.00100, CentOS 7)

breaks fate

make -j12 fate-filter-overlay-dvdsub-2397 V=2


frame=  208 fps=0.0 q=-0.0 Lsize=  48kB time=00:00:08.04 bitrate=  
49.0kbits/s speed=10.6x
video:105300kB audio:1254kB subtitle:0kB other streams:0kB global headers:0kB 
muxing overhead: unknown
pthread_join failed with error: No such process
Aborted (core dumped)
make: *** [fate-filter-overlay-dvdsub-2397] Error 134

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-02-15 Thread Shaofei Wang
It enabled multiple filter graph concurrency, which bring above about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
Reviewed-by: Zhao, Jun 
---
 fftools/ffmpeg.c| 121 
 fftools/ffmpeg.h|  14 ++
 fftools/ffmpeg_filter.c |   1 +
 3 files changed, 128 insertions(+), 8 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..676c783 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -509,6 +509,15 @@ static void ffmpeg_cleanup(int ret)
 }
 av_fifo_freep(>inputs[j]->ist->sub2video.sub_queue);
 }
+#if HAVE_THREADS
+fg->inputs[j]->waited_frm = NULL;
+av_frame_free(>inputs[j]->input_frm);
+pthread_mutex_lock(>inputs[j]->process_mutex);
+fg->inputs[j]->t_end = 1;
+pthread_cond_signal(>inputs[j]->process_cond);
+pthread_mutex_unlock(>inputs[j]->process_mutex);
+pthread_join(fg->inputs[j]->abr_thread, NULL);
+#endif
 av_buffer_unref(>inputs[j]->hw_frames_ctx);
 av_freep(>inputs[j]->name);
 av_freep(>inputs[j]);
@@ -1419,12 +1428,13 @@ static void finish_output_stream(OutputStream *ost)
  *
  * @return  0 for success, <0 for severe errors
  */
-static int reap_filters(int flush)
+static int reap_filters(int flush, InputFilter * ifilter)
 {
 AVFrame *filtered_frame = NULL;
 int i;
 
-/* Reap all buffers present in the buffer sinks */
+/* Reap all buffers present in the buffer sinks or just reap specified
+ * buffer which related with the filter graph who got ifilter as input*/
 for (i = 0; i < nb_output_streams; i++) {
 OutputStream *ost = output_streams[i];
 OutputFile*of = output_files[ost->file_index];
@@ -1436,6 +1446,11 @@ static int reap_filters(int flush)
 continue;
 filter = ost->filter->filter;
 
+if (ifilter) {
+if (ifilter != output_streams[i]->filter->graph->inputs[0])
+continue;
+}
+
 if (!ost->initialized) {
 char error[1024] = "";
 ret = init_output_stream(ost, error, sizeof(error));
@@ -2179,7 +2194,8 @@ static int ifilter_send_frame(InputFilter *ifilter, 
AVFrame *frame)
 }
 }
 
-ret = reap_filters(1);
+ret = HAVE_THREADS ? reap_filters(1, ifilter) : reap_filters(1, NULL);
+
 if (ret < 0 && ret != AVERROR_EOF) {
 av_log(NULL, AV_LOG_ERROR, "Error while filtering: %s\n", 
av_err2str(ret));
 return ret;
@@ -2252,12 +2268,100 @@ static int decode(AVCodecContext *avctx, AVFrame 
*frame, int *got_frame, AVPacke
 return 0;
 }
 
+#if HAVE_THREADS
+static void *filter_pipeline(void *arg)
+{
+InputFilter *fl = arg;
+AVFrame *frm;
+int ret;
+while(1) {
+pthread_mutex_lock(>process_mutex);
+while (fl->waited_frm == NULL && !fl->t_end)
+pthread_cond_wait(>process_cond, >process_mutex);
+pthread_mutex_unlock(>process_mutex);
+
+if (fl->t_end) break;
+
+frm = fl->waited_frm;
+ret = ifilter_send_frame(fl, frm);
+