Re: [FFmpeg-devel] [PATCH] libavcodec/cuviddec.c: increase CUVID_DEFAULT_NUM_SURFACES

Timo Rothenpieler Tue, 25 Feb 2025 10:10:54 -0800

On 25.02.2025 06:36, Scott Theisen wrote:

On 2/22/25 08:16, Timo Rothenpieler wrote:
On 22.02.2025 03:52, Scott Theisen wrote:
On 2/21/25 08:26, Timo Rothenpieler wrote:
On 20.02.2025 21:37, Scott Theisen wrote:
The default value of CuvidContext::nb_surfaces was reduced from 25to 5 (as(CUVID_MAX_DISPLAY_DELAY + 1)) in402d98c9d467dff6931d906ebb732b9a00334e0b.
In cuvid_is_buffer_full() delay can be 2 * CUVID_MAX_DISPLAY_DELAYwith double
rate deinterlacing.  ctx->nb_surfaces is CUVID_DEFAULT_NUM_SURFACES =
(CUVID_MAX_DISPLAY_DELAY + 1) by default, in which casecuvid_is_buffer_full()will always return true and cuvid_output_frame() will never readany data since
it will not call ff_decode_get_packet().
It's been way too long since I looked at all that code, and I didn'teven write most of the code involved:
https://github.com/FFmpeg/FFmpeg/commit/bddb2343b6e594e312dadb5d21b408702929ae04https://github.com/FFmpeg/FFmpeg/commit/402d98c9d467dff6931d906ebb732b9a00334e0b
But doesn't this instead mean that the logic in cuvid_is_buffer_fullis flawed somehow?
I think it is the number of frames ready to send to the driver + thenumber of frames in queue in the driver >= the number of decodedframe buffers. However, it doesn't actually know how many frames arein queue in the driver and assumes the maximum.
Not sure if I understand you right, but the way it works is thatav_fifo_can_read(ctx->frame_queue) returns how many frames havealready been returned from cuvid and are ready for cuviddec.c toreturn them.
To that number, the maximum possible number of delayed frames isadded, which could be returned by the decoder without feeding in anymore input frames.
If that number reaches the desired amount of surfaces to buffer,cuvid_is_buffer_full() will report that its buffer is full, andcuviddec.c will stop fetching new input.
I think the point I was trying to make is that since it doesn't trackhow many frames have been sent to and received from the driver it alwaysassumes the maximum number of delayed frames are available in thedriver, which is obviously not correct when no frames have been sent tothe driver.
Just increasing the default number of surfaces does not seem likethe correct fix or sensible, since it will increase VRAM usage bypotentially quite a bit for all users.
The changes to cuvid_handle_video_sequence() from402d98c9d467dff6931d906ebb732b9a00334e0b will increase nb_surfacesonce data has been read.
Only if the decoder reports that it will potentially buffer even moreframes.
From looking at this a bit, the issue will only happen whendeinterlacing, the logic in cuvid_is_buffer_full becomes stuck then,and will always claim the buffer is full.And from my understanding, it's correct in making that claim. Due tothe display delay, it could in theory happen that the moment cuvidstarts outputting frames, there will be more output available thanwhat fits into ctx->frame_queue, since it delayed by 4 frames, whichresults in 8 surfaces, but the queue only fits 5.
So to me it looks like that the correct fix would be to double thesize of the frame_queue when deinterlacing, not unconditionally.
There is nothing stopping deint_mode or drop_second_field from beingchanged after cuvid_decode_init() is called, so it doesn'tnecessarily know it will deinterlace.
Regardless, 402d98c9d467dff6931d906ebb732b9a00334e0b reducedCUVID_DEFAULT_NUM_SURFACES from 25 to *only 5* to not break playbackentirely. I don't think the intention was to break playback fordouble rate deinterlacing while allowing playback for only singlerate deinterlacing.
Also, if AV_CODEC_FLAG_LOW_DELAY is set, then only one output surfaceis needed, but there are still 5.
The structs stored in the ctx->frame_queue aren't what's using all thememory.It's the frames buffered by cuvid itself, which are referred to bythat buffer, so having it be larger than what cuvid will actuallybuffer doesn't hurt all that much.
But yeah, it could be shrunk in this case.
What this whole dance is actually trying to accomplish is to preventthe number of "ready but not-yet-returned frames" to never exceed themax value possible by cuvid set via ulNumDecodeSurfaces/ulMaxNumDecodeSurfaces, which is determined and stored in ctx->nb_surfaces during cuvid_handle_video_sequence().
In the default mode of operation, the buffer_full indicator willindeed stop pulling new input the moment even one frame is returned.But that's fine, since at that point already a bunch of input has beenconsumed, and a decent delay has been built up already.In low-delay mode frames are pretty much returned the moment it'spossible.
So looking at all this, I still think the core of the issue isincorrect handling of deinterlacing in all this.CUVID treats a deinterlaced frame as one internal frame, but it'sstored in the frame_queue as two frames.
So in the case of deinterlacing without drop_second_field, the size ofthat queue needs to be doubled, but nb_surfaces must stay the same,since for cuvid itself it's still just one frame.And in turn the is_buffer_full function has to be adjusted to multiplynb_surfaces by two if deinterlacing and not drop_second_field.
So you think it needs to be something like this would work?:
```
diff --git a/mythtv/external/FFmpeg/libavcodec/cuviddec.c b/mythtv/external/FFmpeg/libavcodec/cuviddec.c
index 81ac54297e..535ff7afb5 100644
--- a/mythtv/external/FFmpeg/libavcodec/cuviddec.c
+++ b/mythtv/external/FFmpeg/libavcodec/cuviddec.c
@@ -317,13 +317,13 @@ static int CUDAAPIcuvid_handle_video_sequence(void *opaque, CUVIDEOFORMAT* form
          return 0;
      }

-    fifo_size_inc = ctx->nb_surfaces;
+ fifo_size_inc = av_fifo_can_read(ctx->frame_queue) +av_fifo_can_write(ctx->frame_queue); ctx->nb_surfaces = FFMAX(ctx->nb_surfaces, format->min_num_decode_surfaces + 3);
      if (avctx->extra_hw_frames > 0)
          ctx->nb_surfaces += avctx->extra_hw_frames;

-    fifo_size_inc = ctx->nb_surfaces - fifo_size_inc;
+ fifo_size_inc = (ctx->nb_surfaces * 2 ) - fifo_size_inc; // copy *2 logic from cuvid_is_buffer_full()? if (fifo_size_inc > 0 && av_fifo_grow2(ctx->frame_queue,fifo_size_inc) < 0) { av_log(avctx, AV_LOG_ERROR, "Failed to grow frame queue onvideo sequence callback\n");
          ctx->internal_error = AVERROR(ENOMEM);
@@ -417,10 +417,11 @@ static int cuvid_is_buffer_full(AVCodecContext*avctx)
      CuvidContext *ctx = avctx->priv_data;

      int delay = ctx->cuparseinfo.ulMaxDisplayDelay;
+    int output_frames = 1;
if (ctx->deint_mode != cudaVideoDeinterlaceMode_Weave && !ctx->drop_second_field)
-        delay *= 2;
+        output_frames = 2;

-    return av_fifo_can_read(ctx->frame_queue) + delay >= ctx->nb_surfaces;
+ return av_fifo_can_read(ctx->frame_queue) + (delay * output_frames)>= (ctx->nb_surfaces * output_frames); // should this be >=av_fifo_can_read(ctx->frame_queue) + av_fifo_can_write(ctx->frame_queue) ?
  }
static int cuvid_decode_packet(AVCodecContext *avctx, const AVPacket*avpkt)@@ -899,7 +900,7 @@ static av_cold int cuvid_decode_init(AVCodecContext*avctx)
      if(ctx->nb_surfaces < 0)
          ctx->nb_surfaces = CUVID_DEFAULT_NUM_SURFACES;
- ctx->frame_queue = av_fifo_alloc2(ctx->nb_surfaces,sizeof(CuvidParsedFrame), 0);+ ctx->frame_queue = av_fifo_alloc2(ctx->nb_surfaces * 2,sizeof(CuvidParsedFrame), 0);
      if (!ctx->frame_queue) {
          ret = AVERROR(ENOMEM);
          goto error;
```


Something like that, I'm looking into it as well.
Been too long since I looked at that code.

It is odd that there is no function to return AVFifo::nb_elemsdirectly. I think ctx->nb_surfaces was incorrectly used instead ofAVFifo::nb_elems.
Does frame_queue even need to grow, since ulMaxDisplayDelay is fixed at4 (or 0)?

The frame queue contains up to ulMaxNumDecodeSurfaces surfaces, not upto ulMaxDisplayDelay. ulMaxDisplayDelay indicated how many more framesmight be in-flight and be spontaneously dumped out all at once by cuvidon the next input packet, so the queue needs to be kept free enough tocontain them all.

So: The frame_queue needs to be able to contain up toulMaxNumDecodeSurfaces frames. When deinterlacing and not dropping thesecond field, each DecodeSurface produces two frames, so the queue needsto double in size then.

Changing that stuff at runtime is no problem, since to changeanything, cuvid_handle_video_sequence() has to re-run, which updatesall these sizes and will resize the fifo accordingly.And when turning it off, the only thing that happens is thatbuffer_full will report it's full immediately, and some frames need tobe read out before accepting input again.
There's also the edge case of "half a frame" having already beenreturned, so the queue potentially no longer being considered full,but still all decode surfaces are in use, since the other half of thatdeinterlaced frame is still in the queue.
So special care must be taken to not report buffer-free too early.
I'm not understanding you here. Aren't the elements in frame_queuealready extracted from the driver and owned by FFmpeg?

No, the ctx->frame_queue frames are still cuvid frames and will staythat until cuvid_output_frame() removes them from the queue anddownloads them into FFmpeg owned memory.

Thus the amount of distinct frames in ctx->frame_queue must never exceedthat of ulMaxNumDecodeSurfaces.And new packets can only be fed to cuvid while the frame_queue still hasspace to accept up to ulMaxDisplayDelay (*2 when deinterlacing) of newframes, which might spontaneously come out of cuvid at any moment.

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] libavcodec/cuviddec.c: increase CUVID_DEFAULT_NUM_SURFACES

Reply via email to