Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
On 20/02/2019 10:17, Wang, Shaofei wrote: >> -Original Message- >> From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf Of >> Mark Thompson >> Sent: Saturday, February 16, 2019 8:12 PM >> To: ffmpeg-devel@ffmpeg.org >> Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 >> decode + N filter graphs and adaptive bitrate. >> On 15/02/2019 21:54, Shaofei Wang wrote: >>> It enabled multiple filter graph concurrency, which bring above about >>> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration >>> >>> Below are some test cases and comparison as reference. >>> (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) >>> (Software: Intel iHD driver - 16.9.00100, CentOS 7) >>> >>> For 1:N transcode by GPU acceleration with vaapi: >>> ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ >>> -hwaccel_output_format vaapi \ >>> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ >>> -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ >>> -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null >>> >>> test results: >>> 2 encoders 5 encoders 10 encoders >>> Improved 6.1%6.9% 5.5% >>> >>> For 1:N transcode by GPU acceleration with QSV: >>> ./ffmpeg -hwaccel qsv -c:v h264_qsv \ >>> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ >>> -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null >> \ >>> -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null >>> /dev/null >>> >>> test results: >>> 2 encoders 5 encoders 10 encoders >>> Improved 6% 4% 15% >>> >>> For Intel GPU acceleration case, 1 decode to N scaling, by QSV: >>> ./ffmpeg -hwaccel qsv -c:v h264_qsv \ >>> -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ >>> -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f >> null /dev/null \ >>> -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f >>> null /dev/null >>> >>> test results: >>> 2 scale 5 scale 10 scale >>> Improved 12% 21%21% >>> >>> For CPU only 1 decode to N scaling: >>> ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ >>> -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ >>> -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null >>> >>> test results: >>> 2 scale 5 scale 10 scale >>> Improved 25%107% 148% >>> >>> Signed-off-by: Wang, Shaofei >>> Reviewed-by: Zhao, Jun >>> --- >>> fftools/ffmpeg.c| 121 >> >>> fftools/ffmpeg.h| 14 ++ >>> fftools/ffmpeg_filter.c | 1 + >>> 3 files changed, 128 insertions(+), 8 deletions(-) >> >> On a bit more review, I don't think this patch works at all. >> > It has been tested and verified by a lot of cases. More fate cases need to be > covered now. > >> The existing code is all written to be run serially. This simplistic >> approach to >> parallelising it falls down because many of those functions use variables >> written in what were previously other functions called at different times but >> have now become other threads, introducing undefined behaviour due to >> data races. >> > Actually, this is not a patch to parallel every thing in the ffmpeg. It just > thread the input filter > of the filter graph(tend for simple filter graph), which is a simple way to > improve N filter graph > performance and also without introduce huge modification. So that there is > still a lot of serial function call, differences > are that each filter graph need to init its output stream instead of init all > together and each > filter graph will reap filters for its filter chain. Indeed the existing encapsulation tries to keep things mostly separate, but in various places it accesses shared state which works fine in the serial case but fails when those parts are run in parallel. Data races are undefined behaviour in C; introducing them is not acceptable. >> To consider a single example (not the only one), the function >> check_init_output_file() does not work at all after this change. The test >> for >> OutputStream initialisation (so that you run exactly once after all of the >> output streams are ready) races with other threads setting those variables. >> Since that's undefined behaviour you may get lucky sometimes and have the >> output file initialisation run exactly once, but in general it will fail in >> unknown >> ways. >> > > The check_init_output_file() should be responsible for the output file > related with > specified output stream which managed by each thread chain, that means even it > called by different thread, the data setting are different. Let me double > check. Each output file can contain multiple streams - try a transcode with both audio and video filters. (Incidentally, the patch as-is also crashes in this case if the transcode completes
Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
> -Original Message- > From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf Of > Mark Thompson > Sent: Saturday, February 16, 2019 8:12 PM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 > decode + N filter graphs and adaptive bitrate. > On 15/02/2019 21:54, Shaofei Wang wrote: > > It enabled multiple filter graph concurrency, which bring above about > > 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration > > > > Below are some test cases and comparison as reference. > > (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) > > (Software: Intel iHD driver - 16.9.00100, CentOS 7) > > > > For 1:N transcode by GPU acceleration with vaapi: > > ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ > > -hwaccel_output_format vaapi \ > > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ > > -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ > > -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null > > > > test results: > > 2 encoders 5 encoders 10 encoders > > Improved 6.1%6.9% 5.5% > > > > For 1:N transcode by GPU acceleration with QSV: > > ./ffmpeg -hwaccel qsv -c:v h264_qsv \ > > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ > > -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null > \ > > -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null > > /dev/null > > > > test results: > > 2 encoders 5 encoders 10 encoders > > Improved 6% 4% 15% > > > > For Intel GPU acceleration case, 1 decode to N scaling, by QSV: > > ./ffmpeg -hwaccel qsv -c:v h264_qsv \ > > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ > > -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f > null /dev/null \ > > -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f > > null /dev/null > > > > test results: > > 2 scale 5 scale 10 scale > > Improved 12% 21%21% > > > > For CPU only 1 decode to N scaling: > > ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ > > -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ > > -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null > > > > test results: > > 2 scale 5 scale 10 scale > > Improved 25%107% 148% > > > > Signed-off-by: Wang, Shaofei > > Reviewed-by: Zhao, Jun > > --- > > fftools/ffmpeg.c| 121 > > > fftools/ffmpeg.h| 14 ++ > > fftools/ffmpeg_filter.c | 1 + > > 3 files changed, 128 insertions(+), 8 deletions(-) > > On a bit more review, I don't think this patch works at all. > It has been tested and verified by a lot of cases. More fate cases need to be covered now. > The existing code is all written to be run serially. This simplistic > approach to > parallelising it falls down because many of those functions use variables > written in what were previously other functions called at different times but > have now become other threads, introducing undefined behaviour due to > data races. > Actually, this is not a patch to parallel every thing in the ffmpeg. It just thread the input filter of the filter graph(tend for simple filter graph), which is a simple way to improve N filter graph performance and also without introduce huge modification. So that there is still a lot of serial function call, differences are that each filter graph need to init its output stream instead of init all together and each filter graph will reap filters for its filter chain. > To consider a single example (not the only one), the function > check_init_output_file() does not work at all after this change. The test for > OutputStream initialisation (so that you run exactly once after all of the > output streams are ready) races with other threads setting those variables. > Since that's undefined behaviour you may get lucky sometimes and have the > output file initialisation run exactly once, but in general it will fail in > unknown > ways. > The check_init_output_file() should be responsible for the output file related with specified output stream which managed by each thread chain, that means even it called by different thread, the data setting are different. Let me double check. > If you want to resubmit this patch, you will need to refactor a lot of the > other > code in ffmpeg.c to rule out these undefined cases. > OK. This patch would only effect on SIMPLE filter graph. > - Mark > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
Thanks. Seems I need to cover external samples either > -Original Message- > From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf Of > Michael Niedermayer > Sent: Saturday, February 16, 2019 5:22 AM > To: FFmpeg development discussions and patches > Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 > decode + N filter graphs and adaptive bitrate. > > On Fri, Feb 15, 2019 at 04:54:23PM -0500, Shaofei Wang wrote: > > It enabled multiple filter graph concurrency, which bring above about > > 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration > > > > Below are some test cases and comparison as reference. > > (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) > > (Software: Intel iHD driver - 16.9.00100, CentOS 7) > > breaks fate > > make -j12 fate-filter-overlay-dvdsub-2397 V=2 > > > frame= 208 fps=0.0 q=-0.0 Lsize= 48kB time=00:00:08.04 bitrate= > 49.0kbits/s speed=10.6x > video:105300kB audio:1254kB subtitle:0kB other streams:0kB global > headers:0kB muxing overhead: unknown pthread_join failed with error: No > such process Aborted (core dumped) > make: *** [fate-filter-overlay-dvdsub-2397] Error 134 > > [...] > -- > Michael GnuPG fingerprint: > 9FF2128B147EF6730BADF133611EC787040B0FAB > > Asymptotically faster algorithms should always be preferred if you have > asymptotical amounts of data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
On 15/02/2019 21:54, Shaofei Wang wrote: > It enabled multiple filter graph concurrency, which bring above about > 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration > > Below are some test cases and comparison as reference. > (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) > (Software: Intel iHD driver - 16.9.00100, CentOS 7) > > For 1:N transcode by GPU acceleration with vaapi: > ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ > -hwaccel_output_format vaapi \ > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ > -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ > -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null > > test results: > 2 encoders 5 encoders 10 encoders > Improved 6.1%6.9% 5.5% > > For 1:N transcode by GPU acceleration with QSV: > ./ffmpeg -hwaccel qsv -c:v h264_qsv \ > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ > -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ > -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null > > test results: > 2 encoders 5 encoders 10 encoders > Improved 6% 4% 15% > > For Intel GPU acceleration case, 1 decode to N scaling, by QSV: > ./ffmpeg -hwaccel qsv -c:v h264_qsv \ > -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ > -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null > /dev/null \ > -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null > /dev/null > > test results: > 2 scale 5 scale 10 scale > Improved 12% 21%21% > > For CPU only 1 decode to N scaling: > ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ > -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ > -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null > > test results: > 2 scale 5 scale 10 scale > Improved 25%107% 148% > > Signed-off-by: Wang, Shaofei > Reviewed-by: Zhao, Jun > --- > fftools/ffmpeg.c| 121 > > fftools/ffmpeg.h| 14 ++ > fftools/ffmpeg_filter.c | 1 + > 3 files changed, 128 insertions(+), 8 deletions(-) On a bit more review, I don't think this patch works at all. The existing code is all written to be run serially. This simplistic approach to parallelising it falls down because many of those functions use variables written in what were previously other functions called at different times but have now become other threads, introducing undefined behaviour due to data races. To consider a single example (not the only one), the function check_init_output_file() does not work at all after this change. The test for OutputStream initialisation (so that you run exactly once after all of the output streams are ready) races with other threads setting those variables. Since that's undefined behaviour you may get lucky sometimes and have the output file initialisation run exactly once, but in general it will fail in unknown ways. If you want to resubmit this patch, you will need to refactor a lot of the other code in ffmpeg.c to rule out these undefined cases. - Mark ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
On Fri, Feb 15, 2019 at 04:54:23PM -0500, Shaofei Wang wrote: > It enabled multiple filter graph concurrency, which bring above about > 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration > > Below are some test cases and comparison as reference. > (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) > (Software: Intel iHD driver - 16.9.00100, CentOS 7) breaks fate make -j12 fate-filter-overlay-dvdsub-2397 V=2 frame= 208 fps=0.0 q=-0.0 Lsize= 48kB time=00:00:08.04 bitrate= 49.0kbits/s speed=10.6x video:105300kB audio:1254kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown pthread_join failed with error: No such process Aborted (core dumped) make: *** [fate-filter-overlay-dvdsub-2397] Error 134 [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Asymptotically faster algorithms should always be preferred if you have asymptotical amounts of data signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
It enabled multiple filter graph concurrency, which bring above about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei Reviewed-by: Zhao, Jun --- fftools/ffmpeg.c| 121 fftools/ffmpeg.h| 14 ++ fftools/ffmpeg_filter.c | 1 + 3 files changed, 128 insertions(+), 8 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..676c783 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -509,6 +509,15 @@ static void ffmpeg_cleanup(int ret) } av_fifo_freep(>inputs[j]->ist->sub2video.sub_queue); } +#if HAVE_THREADS +fg->inputs[j]->waited_frm = NULL; +av_frame_free(>inputs[j]->input_frm); +pthread_mutex_lock(>inputs[j]->process_mutex); +fg->inputs[j]->t_end = 1; +pthread_cond_signal(>inputs[j]->process_cond); +pthread_mutex_unlock(>inputs[j]->process_mutex); +pthread_join(fg->inputs[j]->abr_thread, NULL); +#endif av_buffer_unref(>inputs[j]->hw_frames_ctx); av_freep(>inputs[j]->name); av_freep(>inputs[j]); @@ -1419,12 +1428,13 @@ static void finish_output_stream(OutputStream *ost) * * @return 0 for success, <0 for severe errors */ -static int reap_filters(int flush) +static int reap_filters(int flush, InputFilter * ifilter) { AVFrame *filtered_frame = NULL; int i; -/* Reap all buffers present in the buffer sinks */ +/* Reap all buffers present in the buffer sinks or just reap specified + * buffer which related with the filter graph who got ifilter as input*/ for (i = 0; i < nb_output_streams; i++) { OutputStream *ost = output_streams[i]; OutputFile*of = output_files[ost->file_index]; @@ -1436,6 +1446,11 @@ static int reap_filters(int flush) continue; filter = ost->filter->filter; +if (ifilter) { +if (ifilter != output_streams[i]->filter->graph->inputs[0]) +continue; +} + if (!ost->initialized) { char error[1024] = ""; ret = init_output_stream(ost, error, sizeof(error)); @@ -2179,7 +2194,8 @@ static int ifilter_send_frame(InputFilter *ifilter, AVFrame *frame) } } -ret = reap_filters(1); +ret = HAVE_THREADS ? reap_filters(1, ifilter) : reap_filters(1, NULL); + if (ret < 0 && ret != AVERROR_EOF) { av_log(NULL, AV_LOG_ERROR, "Error while filtering: %s\n", av_err2str(ret)); return ret; @@ -2252,12 +2268,100 @@ static int decode(AVCodecContext *avctx, AVFrame *frame, int *got_frame, AVPacke return 0; } +#if HAVE_THREADS +static void *filter_pipeline(void *arg) +{ +InputFilter *fl = arg; +AVFrame *frm; +int ret; +while(1) { +pthread_mutex_lock(>process_mutex); +while (fl->waited_frm == NULL && !fl->t_end) +pthread_cond_wait(>process_cond, >process_mutex); +pthread_mutex_unlock(>process_mutex); + +if (fl->t_end) break; + +frm = fl->waited_frm; +ret = ifilter_send_frame(fl, frm); +