On 11/04/17 08:30, Jun Zhao wrote:
> From 9bab458006369f427fa2f4c6248ee89329e81067 Mon Sep 17 00:00:00 2001
> From: Jun Zhao <jun.z...@intel.com>
> Date: Tue, 11 Apr 2017 14:37:07 +0800
> Subject: [PATCH] hwcontext_vaapi: use the special UC copy for downloading
>  frames.
> 
> used SSE4 UC function for copying image data from GPU mapped memory,
> see 
> https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
> 
> before this change, VA-API HWAccel decoder copy image data from GPU
> mapped memory used vaCreateImage/vaGetImage/av_frame_copy, now use
> vaDeriveImage/av_image_copy_uc_from.
> 
> decoding a 3000 frames 1080p h264 stream in Intel(R) Core(TM)
> i5-6260U CPU @ 1.80GHz, the CPU usage and decode fps as follow:
> 
> 1. Software decoder.
> ./ffmpeg -i ./skyfall2-trailer.mp4 -f null /dev/null
> 
> CPU: 80%, fps: 334fps
> 
> 2a. vaCreateImage/vaGetImage/av_frame_copy
> ./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i 
> skyfall2-trailer.mp4 -f null /dev/null
> 
> CPU: 12%, fps: 147fps
> 
> 2b. vaDeriveImage/av_image_copy_uc_from
> ./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i 
> skyfall2-trailer.mp4 -f null /dev/null
> 
> CPU: 23%, fps: 628fps
> 
> Signed-off-by: Jun Zhao <jun.z...@intel.com>
> ---

This change was considered in libav when the UC copy function was introduced 
(<https://lists.libav.org/pipermail/libav-devel/2016-August/078826.html>, 
<https://lists.libav.org/pipermail/libav-devel/2016-August/078825.html>), but 
was not in the end applied.

The reasons for this were:

* It had much worse performance on the low-power cores - try your benchmark 
above on Braswell.

* The performance gain on high-power cores was marginal - we had results of 
maybe +20% performance in the best case with +40% or more CPU time used.  I 
admit this seems at odds with your results above, I'll try running it again 
later with current ffmpeg.

* Most software uses don't actually want the native surface format (typically 
NV12), so you end up with an extra convert step in software anyway.  VAAPI 
supports downloading directly to YUV420P - try with -hwaccel_output_format 
yuv420p or -vf hwmap=read,format=yuv420p.

(Note that none of these arguments applied to DXVA2 because it has no 
GPU-driven download, so indeed it was applied there (and is why the streaming 
copy code exists in libavutil at all).)

Still, performance results on more cores would be welcome for comparison, 
particularly newer ones - can you test on a Gemini Lake or Apollo Lake?


Some patch comments as well:

>  libavutil/hwcontext_vaapi.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/libavutil/hwcontext_vaapi.c b/libavutil/hwcontext_vaapi.c
> index 3b50e95..23899f1 100644
> --- a/libavutil/hwcontext_vaapi.c
> +++ b/libavutil/hwcontext_vaapi.c
> @@ -40,6 +40,7 @@
>  #include "mem.h"
>  #include "pixdesc.h"
>  #include "pixfmt.h"
> +#include "imgutils.h"
>  
>  typedef struct VAAPIDevicePriv {
>  #if HAVE_VAAPI_X11
> @@ -720,7 +721,7 @@ static int vaapi_map_frame(AVHWFramesContext *hwfc,
>      // assume for now that the user is not aware of that and would therefore
>      // prefer not to be given direct-mapped memory if they request read 
> access.
>      if (ctx->derive_works && dst->format == hwfc->sw_format &&
> -        ((flags & AV_HWFRAME_MAP_DIRECT) || !(flags & AV_HWFRAME_MAP_READ))) 
> {
> +        ((flags & AV_HWFRAME_MAP_DIRECT) || (flags & AV_HWFRAME_MAP_READ))) {

This disables direct-mapping by default for write (which is always correct on 
current platforms, as far as I know?).

More generally, I don't think you should be modifying this function.  It also 
gets used for av_hwframe_map(), and defaulting to giving users uncached memory 
for reading when they aren't expecting it is a complete disaster for 
performance.

>          vas = vaDeriveImage(hwctx->display, surface_id, &map->image);
>          if (vas != VA_STATUS_SUCCESS) {
>              av_log(hwfc, AV_LOG_ERROR, "Failed to derive image from "
> @@ -736,7 +737,6 @@ static int vaapi_map_frame(AVHWFramesContext *hwfc,
>              err = AVERROR(EIO);
>              goto fail;
>          }
> -        map->flags |= AV_HWFRAME_MAP_DIRECT;
>      } else {
>          vas = vaCreateImage(hwctx->display, image_format,
>                              hwfc->width, hwfc->height, &map->image);
> @@ -806,7 +806,8 @@ static int vaapi_transfer_data_from(AVHWFramesContext 
> *hwfc,
>                                      AVFrame *dst, const AVFrame *src)
>  {
>      AVFrame *map;
> -    int err;
> +    int i,err;
> +    ptrdiff_t src_linesize[4], dst_linesize[4];
>  
>      if (dst->width > hwfc->width || dst->height > hwfc->height)
>          return AVERROR(EINVAL);
> @@ -823,11 +824,12 @@ static int vaapi_transfer_data_from(AVHWFramesContext 
> *hwfc,
>      map->width  = dst->width;
>      map->height = dst->height;
>  
> -    err = av_frame_copy(dst, map);
> -    if (err)
> -        goto fail;
> -
> -    err = 0;
> +    for (i = 0; i < 4; i++) {
> +        dst_linesize[i] = dst->linesize[i];
> +        src_linesize[i] = map->linesize[i];
> +    }
> +    av_image_copy_uc_from(dst->data, dst_linesize, map->data, src_linesize,
> +                          hwfc->sw_format, src->width, src->height);

This isn't necessarily sw_format, because the formats don't have to match in 
the download (the NV12 -> YUV420P case).

Also, I think you want to only use the streaming copy when the source is 
uncached.  (I assume glibc memcpy() will always win if both are normal memory, 
though I admit I've never tested that.)

>  fail:
>      av_frame_free(&map);
>      return err;
> -- 
> 2.9.3

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to