Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
On 08.05.2018 10:11, Oscar Amoros Huguet wrote: > Thank you so much! > > We will test this hopefully today, and verify the expected behavior with > NSIGHT. > > By the way, I'm new to ffmpeg, so... I don't know if you use your fork to > test things first, before adding the changes to the main ffmpeg project? Or > may we consider to compile and use your fork? > > Thanks! I'm only using my github fork to stage new patches. Most of the times it's notably behind ffmpeg master or contains half-broken WIP patches. So I wouldn't recommend using it on a regular basis. signature.asc Description: OpenPGP digital signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Thank you so much! We will test this hopefully today, and verify the expected behavior with NSIGHT. By the way, I'm new to ffmpeg, so... I don't know if you use your fork to test things first, before adding the changes to the main ffmpeg project? Or may we consider to compile and use your fork? Thanks! -Original Message- From: ffmpeg-devel On Behalf Of Timo Rothenpieler Sent: Tuesday, May 8, 2018 12:11 AM To: ffmpeg-devel@ffmpeg.org Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC Am 07.05.2018 um 19:37 schrieb Oscar Amoros Huguet: > I was looking at the NVIDIA Video codec sdk samples > (https://developer.nvidia.com/nvidia-video-codec-sdk#Download), where you can > find the header NvDecoder.h next to cuviddec.h where CUVIDPROCPARAMS is > defined. > > Anyway, I should have looked at ffmpeg code directly, to see what’s being > used, sorry for that. > > Great then! Having the same cuda stream (either default or custom) for > everything is the safest and most efficient way to go (in this case). > > Thanks, and let me know if I can help with anything. > You can find the mentioned changes on my personal github: https://github.com/BtbN/FFmpeg The cuMemcpy is entirely gone, not just Async. And the CUstream can be set and will be passed on to cuvid/nvdec. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Am 07.05.2018 um 19:37 schrieb Oscar Amoros Huguet: I was looking at the NVIDIA Video codec sdk samples (https://developer.nvidia.com/nvidia-video-codec-sdk#Download), where you can find the header NvDecoder.h next to cuviddec.h where CUVIDPROCPARAMS is defined. Anyway, I should have looked at ffmpeg code directly, to see what’s being used, sorry for that. Great then! Having the same cuda stream (either default or custom) for everything is the safest and most efficient way to go (in this case). Thanks, and let me know if I can help with anything. You can find the mentioned changes on my personal github: https://github.com/BtbN/FFmpeg The cuMemcpy is entirely gone, not just Async. And the CUstream can be set and will be passed on to cuvid/nvdec. smime.p7s Description: S/MIME Cryptographic Signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Thanks for the tip on the push/pop solution (custom version of the ffnvcodec headers). It works for us, we may do as you say. Thanks again. Oscar -Original Message- From: ffmpeg-devel On Behalf Of Timo Rothenpieler Sent: Monday, May 7, 2018 1:25 PM To: ffmpeg-devel@ffmpeg.org Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC On 26.04.2018 18:03, Oscar Amoros Huguet wrote: > Thanks Mark, > > You are right, we can implement in our code a sort of "av_hwdevice_ctx_set" > (which does not exist), by using av_hwdevice_ctx_alloc() + > av_hwdevice_ctx_init(). We actually use av_hwdevice_ctx_alloc in our code to > use the feature we implemented already. We are not sure about license > implications though, we link dynamically to work with LGPL. I guess both > calls are public, since they are not in "internal" labelled files. > > We are perfectly ok with using av_hwdevice_ctx_alloc() + > av_hwdevice_ctx_init() outside ffmpeg, to use our own CUDA context. By doing > so, in the current ffmpeg code, there is an internal flag " > AVCUDADeviceContextInternal.is_allocated" that is not set to 1, therefore, > the cuda context is not destroyed by ffmpeg in "cuda_device_uninit", which is > the desired behavior. > > In fact, this flag implies that the context was not allocated by ffmpeg. > Maybe this is the right flag to be used to avoid push/pop pairs when the CUDA > context is not created by ffmpeg. What do you think? > > We can adapt all of the push/pop pairs on the code, to follow this policy, > whichever flag is used. > > About the performance effects of this push/pop calls, we have seen with > NVIDIA profiling tools (NSIGHT for Visual Studio plugin), that the CUDA > runtime detects that the context you wat to set is the same as the one > currently set, so the push call does nothing and lasts 0.0016 ms in average > (CPU time). But for some reason, the cuCtxPopCurrent call, does take some > more time, and uses 0.02 ms of CPU time per call. This is 0,16 ms total per > frame when decoding 8 feeds. This is small, but it's easy to remove. I'm not a fan of touching every single bit of CUDA-related code for this. Push/Pop, specially for the context that's already active, should be free. If it's not, that's something I'd complain to nvidia about. For your specific usecase, you could build FFmpeg with a custom version of the ffnvcodec headers, that has a custom function for the push/pop ctx functions, practically noops. > Additionally, could you give your opinion on the feature we also may want to > add in the future, that we mentioned in the previous email? Basically, we may > want to add one more CUDA function, specifically cuMemcpy2DAsync, and the > possibility to set a CUStream in AVCUDADeviceContext, so it is used with > cuMemcpy2DAsync instead of cuMemcpy2D in "nvdec_retrieve_data" in file > libavcodec/nvdec.c. In our use case this would save up to 0.72 ms (GPU time) > per frame, in case of decoding 8 fullhd frames, and up to 0.5 ms (GPU time) > per frame, in case of decoding two 4k frames. This may sound too little, but > for us is significant. Our software needs to do many things in a maximum of > 33ms with CUDA on the GPU per frame, and we have little GPU time left. This is interesting and I'm considering making that the default, as it would fit well with the current infrastructure, delaying the sync call to the moment the frame leaves avcodec, which with the internal re-ordering and delay should give plenty of time for the copy to finish. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Hi! Even if there is need to have a syncronization before leaving the ffmpeg call, callin cuMemcpyAsync will allow the copies to overlap with any other task on the gpu, that was enqueued using any other non-blocking cuda stream. That’s exactly what we want to achieve. This would benefit automatically any other app that uses non-blocking cuda streams, as independent cuda workflows. Oscar Enviat des del meu iPhone El 7 maig 2018, a les 13:54, Timo Rothenpieler va escriure: >>> Additionally, could you give your opinion on the feature we also may > want to add in the future, that we mentioned in the previous email? > Basically, we may want to add one more CUDA function, specifically > cuMemcpy2DAsync, and the possibility to set a CUStream in > AVCUDADeviceContext, so it is used with cuMemcpy2DAsync instead of > cuMemcpy2D in "nvdec_retrieve_data" in file libavcodec/nvdec.c. In our > use case this would save up to 0.72 ms (GPU time) per frame, in case of > decoding 8 fullhd frames, and up to 0.5 ms (GPU time) per frame, in case > of decoding two 4k frames. This may sound too little, but for us is > significant. Our software needs to do many things in a maximum of 33ms > with CUDA on the GPU per frame, and we have little GPU time left. >> >> This is interesting and I'm considering making that the default, as it >> would fit well with the current infrastructure, delaying the sync call >> to the moment the frame leaves avcodec, which with the internal >> re-ordering and delay should give plenty of time for the copy to finish. > > I'm not sure if/how well this works with the mapped cuvid frames though. > The frame would already be unmapped and potentially re-used again before > the async copy completes. So it would need an immediately call to Sync > right after the 3 async copy calls, making the entire effort pointless. > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
I was looking at the NVIDIA Video codec sdk samples (https://developer.nvidia.com/nvidia-video-codec-sdk#Download), where you can find the header NvDecoder.h next to cuviddec.h where CUVIDPROCPARAMS is defined. Anyway, I should have looked at ffmpeg code directly, to see what’s being used, sorry for that. Great then! Having the same cuda stream (either default or custom) for everything is the safest and most efficient way to go (in this case). Thanks, and let me know if I can help with anything. Oscar > El 7 maig 2018, a les 18:43, Timo Rothenpieler va > escriure: > > Am 07.05.2018 um 18:25 schrieb Oscar Amoros Huguet: >> Have a look at this, looks pretty interesting: >> /** >> * @brief This function decodes a frame and returns the locked frame >> buffers >> * This makes the buffers available for use by the application without >> the buffers >> * getting overwritten, even if subsequent decode calls are made. The >> frame buffers >> * remain locked, until ::UnlockFrame() is called >> */ >> bool DecodeLockFrame(const uint8_t *pData, int nSize, uint8_t >> ***pppFrame, int *pnFrameReturned, uint32_t flags = 0, int64_t **ppTimestamp >> = NULL, int64_t timestamp = 0, CUstream stream = 0); >> Oscar > > I'm not sure what API docs you are referring to here. > Google has never seen them either. > > But CUVIDPROCPARAMS, which is passed to cuvidMapVideoFrame, does indeed have > CUstream output_stream;/**< IN: stream object used by cuvidMapVideoFrame */ > So setting the stream there would be easily possible. > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Am 07.05.2018 um 18:25 schrieb Oscar Amoros Huguet: Have a look at this, looks pretty interesting: /** * @brief This function decodes a frame and returns the locked frame buffers * This makes the buffers available for use by the application without the buffers * getting overwritten, even if subsequent decode calls are made. The frame buffers * remain locked, until ::UnlockFrame() is called */ bool DecodeLockFrame(const uint8_t *pData, int nSize, uint8_t ***pppFrame, int *pnFrameReturned, uint32_t flags = 0, int64_t **ppTimestamp = NULL, int64_t timestamp = 0, CUstream stream = 0); Oscar I'm not sure what API docs you are referring to here. Google has never seen them either. But CUVIDPROCPARAMS, which is passed to cuvidMapVideoFrame, does indeed have CUstream output_stream;/**< IN: stream object used by cuvidMapVideoFrame */ So setting the stream there would be easily possible. smime.p7s Description: S/MIME Cryptographic Signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Have a look at this, looks pretty interesting: /** * @brief This function decodes a frame and returns the locked frame buffers * This makes the buffers available for use by the application without the buffers * getting overwritten, even if subsequent decode calls are made. The frame buffers * remain locked, until ::UnlockFrame() is called */ bool DecodeLockFrame(const uint8_t *pData, int nSize, uint8_t ***pppFrame, int *pnFrameReturned, uint32_t flags = 0, int64_t **ppTimestamp = NULL, int64_t timestamp = 0, CUstream stream = 0); Oscar -Original Message- From: ffmpeg-devel On Behalf Of Oscar Amoros Huguet Sent: Monday, May 7, 2018 6:21 PM To: FFmpeg development discussions and patches Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC Removing the need for the memcpy itself would clearly be the best. Looking at NSIGHT, I see that NVDEC internally calls a color space transformation kernel on the default stream, and does not synchronize with the calling CPU thread. The cuMemcpy calls you have right now, use the same default stream, and do block with the calling CPU thread. So they perform an implicit synchronization with the CPU thread. This means, that if you remove the Memcpy's, and the user wants to make any cuda call, over the results of this kernel, to make it safely, they have two options: 1 Either they use the same default stream (which is what I'm trying to avoid here). 2 Or the NvDecoder call "bool Decode(const uint8_t *pData, int nSize, uint8_t ***pppFrame, int *pnFrameReturned, uint32_t flags = 0, int64_t **ppTimestamp = NULL, int64_t timestamp = 0, CUstream stream = 0)" uses the cuda stream specified by ffmpeg, as we where saying in the previous emails, instead of not specifying any stream and therefore always defaulting to the stream 0, or default stream. So Decode(..., ..., ..., ..., ..., ..., ..., cuda_stream)" The second option has another benefit. If the ffmpeg user, specifies it's own non-default stream, then, this kernel joins the "overlapping world", and can overlap with any other cuda task. Saving even more time. Hope it helps! If there are other places where cuMemcpy is called, (we don't use it, but I think I saw it somewhere in the code) I think it would be nice to have the option to use a custom cuda stream, and keep it as is otherwise just by not setting a custom stream. P.S: I had thoughts of talking to NVIDIA to know if there is a way to not call this kernel, and get whatever comes from the encoder directly, so we can transform it to the format we need. That is, calling one kernel instead of two. I'll let you know if we do, in case this becomes an option. I wonder what uint32_t flags is used for though. It's not explained in the headers. -Original Message- From: ffmpeg-devel On Behalf Of Timo Rothenpieler Sent: Monday, May 7, 2018 5:13 PM To: ffmpeg-devel@ffmpeg.org Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC Am 07.05.2018 um 17:05 schrieb Oscar Amoros Huguet: > To clarify a bit what I was saying in the last email. When I said CUDA > non-blocking streams, I meant non-default streams. All non-blocking > streams are non-default streams, but non-default streams can be > blocking or non-bloking with respect to the default streams. > https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.htm > l > > So, using cuMemcpyAsync, would allow the memory copies to overlap with > any other copy or kernel execution, enqueued in any other non-default > stream. > https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ > > If cuStreamSynchronize has to be called right after the last cuMemcpyAsync > call, I see different ways of implementing this, but probably you will most > likely prefer the following: > > Add the cuMemcpyAsync to the list of cuda functions. > Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) > by default. Let's name it "CUstream cuda_stream"? > Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as > the last parameter. cuMemcpyAsync(..., ..., ..., cuda_stream); After > the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. > cuStreamSynchronize(cuda_stream); > > If the user does not change the context and the stream, the behavior will be > exactly the same as it is now. No synchronization hazards. Because passing > "0" as the cuda stream, makes the calls blocking, as if they weren't > asynchronous calls. > > But, if the user wants the copies to overlap with the rest
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Removing the need for the memcpy itself would clearly be the best. Looking at NSIGHT, I see that NVDEC internally calls a color space transformation kernel on the default stream, and does not synchronize with the calling CPU thread. The cuMemcpy calls you have right now, use the same default stream, and do block with the calling CPU thread. So they perform an implicit synchronization with the CPU thread. This means, that if you remove the Memcpy's, and the user wants to make any cuda call, over the results of this kernel, to make it safely, they have two options: 1 Either they use the same default stream (which is what I'm trying to avoid here). 2 Or the NvDecoder call "bool Decode(const uint8_t *pData, int nSize, uint8_t ***pppFrame, int *pnFrameReturned, uint32_t flags = 0, int64_t **ppTimestamp = NULL, int64_t timestamp = 0, CUstream stream = 0)" uses the cuda stream specified by ffmpeg, as we where saying in the previous emails, instead of not specifying any stream and therefore always defaulting to the stream 0, or default stream. So Decode(..., ..., ..., ..., ..., ..., ..., cuda_stream)" The second option has another benefit. If the ffmpeg user, specifies it's own non-default stream, then, this kernel joins the "overlapping world", and can overlap with any other cuda task. Saving even more time. Hope it helps! If there are other places where cuMemcpy is called, (we don't use it, but I think I saw it somewhere in the code) I think it would be nice to have the option to use a custom cuda stream, and keep it as is otherwise just by not setting a custom stream. P.S: I had thoughts of talking to NVIDIA to know if there is a way to not call this kernel, and get whatever comes from the encoder directly, so we can transform it to the format we need. That is, calling one kernel instead of two. I'll let you know if we do, in case this becomes an option. I wonder what uint32_t flags is used for though. It's not explained in the headers. -Original Message- From: ffmpeg-devel On Behalf Of Timo Rothenpieler Sent: Monday, May 7, 2018 5:13 PM To: ffmpeg-devel@ffmpeg.org Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC Am 07.05.2018 um 17:05 schrieb Oscar Amoros Huguet: > To clarify a bit what I was saying in the last email. When I said CUDA > non-blocking streams, I meant non-default streams. All non-blocking > streams are non-default streams, but non-default streams can be > blocking or non-bloking with respect to the default streams. > https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.htm > l > > So, using cuMemcpyAsync, would allow the memory copies to overlap with > any other copy or kernel execution, enqueued in any other non-default > stream. > https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ > > If cuStreamSynchronize has to be called right after the last cuMemcpyAsync > call, I see different ways of implementing this, but probably you will most > likely prefer the following: > > Add the cuMemcpyAsync to the list of cuda functions. > Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) > by default. Let's name it "CUstream cuda_stream"? > Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as > the last parameter. cuMemcpyAsync(..., ..., ..., cuda_stream); After > the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. > cuStreamSynchronize(cuda_stream); > > If the user does not change the context and the stream, the behavior will be > exactly the same as it is now. No synchronization hazards. Because passing > "0" as the cuda stream, makes the calls blocking, as if they weren't > asynchronous calls. > > But, if the user wants the copies to overlap with the rest of it's > application, he can set it's own cuda context, and it's own non-default > stream. > > In any of the cases, ffmpeg does not have to handle cuda stream creation and > destruction, which makes it simpler. > > Hope you like it! A different idea I'm looking at right now is to get rid of the memcpy entirely, turning the mapped cuvid frame into an AVFrame itself, with a buffer_ref that unmaps the cuvid frame when freeing it, instead of allocating a whole new buffer and copying it over. I'm not sure how that will play out with available free surfaces, but I will test. I'll also add the stream basically like you described, as it seems useful to have around anyway. If previously mentioned approach does not work, I'll implement this like described, probably for all cuMemCpy* in ffmpeg, as it at least does run the 2/3 plane copys asynchronous. Not sure if it can be changed to actually do them in parallel. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Am 07.05.2018 um 17:05 schrieb Oscar Amoros Huguet: To clarify a bit what I was saying in the last email. When I said CUDA non-blocking streams, I meant non-default streams. All non-blocking streams are non-default streams, but non-default streams can be blocking or non-bloking with respect to the default streams. https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html So, using cuMemcpyAsync, would allow the memory copies to overlap with any other copy or kernel execution, enqueued in any other non-default stream. https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ If cuStreamSynchronize has to be called right after the last cuMemcpyAsync call, I see different ways of implementing this, but probably you will most likely prefer the following: Add the cuMemcpyAsync to the list of cuda functions. Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) by default. Let's name it "CUstream cuda_stream"? Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as the last parameter. cuMemcpyAsync(..., ..., ..., cuda_stream); After the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. cuStreamSynchronize(cuda_stream); If the user does not change the context and the stream, the behavior will be exactly the same as it is now. No synchronization hazards. Because passing "0" as the cuda stream, makes the calls blocking, as if they weren't asynchronous calls. But, if the user wants the copies to overlap with the rest of it's application, he can set it's own cuda context, and it's own non-default stream. In any of the cases, ffmpeg does not have to handle cuda stream creation and destruction, which makes it simpler. Hope you like it! A different idea I'm looking at right now is to get rid of the memcpy entirely, turning the mapped cuvid frame into an AVFrame itself, with a buffer_ref that unmaps the cuvid frame when freeing it, instead of allocating a whole new buffer and copying it over. I'm not sure how that will play out with available free surfaces, but I will test. I'll also add the stream basically like you described, as it seems useful to have around anyway. If previously mentioned approach does not work, I'll implement this like described, probably for all cuMemCpy* in ffmpeg, as it at least does run the 2/3 plane copys asynchronous. Not sure if it can be changed to actually do them in parallel. smime.p7s Description: S/MIME Cryptographic Signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
To clarify a bit what I was saying in the last email. When I said CUDA non-blocking streams, I meant non-default streams. All non-blocking streams are non-default streams, but non-default streams can be blocking or non-bloking with respect to the default streams. https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html So, using cuMemcpyAsync, would allow the memory copies to overlap with any other copy or kernel execution, enqueued in any other non-default stream. https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ If cuStreamSynchronize has to be called right after the last cuMemcpyAsync call, I see different ways of implementing this, but probably you will most likely prefer the following: Add the cuMemcpyAsync to the list of cuda functions. Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) by default. Let's name it "CUstream cuda_stream"? Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as the last parameter. cuMemcpyAsync(..., ..., ..., cuda_stream); After the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. cuStreamSynchronize(cuda_stream); If the user does not change the context and the stream, the behavior will be exactly the same as it is now. No synchronization hazards. Because passing "0" as the cuda stream, makes the calls blocking, as if they weren't asynchronous calls. But, if the user wants the copies to overlap with the rest of it's application, he can set it's own cuda context, and it's own non-default stream. In any of the cases, ffmpeg does not have to handle cuda stream creation and destruction, which makes it simpler. Hope you like it! Oscar -Original Message- From: Oscar Amoros Huguet Sent: Monday, May 7, 2018 2:05 PM To: FFmpeg development discussions and patches Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC Hi! Even if there is need to have a syncronization before leaving the ffmpeg call, callin cuMemcpyAsync will allow the copies to overlap with any other task on the gpu, that was enqueued using any other non-blocking cuda stream. That’s exactly what we want to achieve. This would benefit automatically any other app that uses non-blocking cuda streams, as independent cuda workflows. Oscar Enviat des del meu iPhone El 7 maig 2018, a les 13:54, Timo Rothenpieler va escriure: >>> Additionally, could you give your opinion on the feature we also may > want to add in the future, that we mentioned in the previous email? > Basically, we may want to add one more CUDA function, specifically > cuMemcpy2DAsync, and the possibility to set a CUStream in > AVCUDADeviceContext, so it is used with cuMemcpy2DAsync instead of > cuMemcpy2D in "nvdec_retrieve_data" in file libavcodec/nvdec.c. In our > use case this would save up to 0.72 ms (GPU time) per frame, in case > of decoding 8 fullhd frames, and up to 0.5 ms (GPU time) per frame, in > case of decoding two 4k frames. This may sound too little, but for us > is significant. Our software needs to do many things in a maximum of > 33ms with CUDA on the GPU per frame, and we have little GPU time left. >> >> This is interesting and I'm considering making that the default, as >> it would fit well with the current infrastructure, delaying the sync >> call to the moment the frame leaves avcodec, which with the internal >> re-ordering and delay should give plenty of time for the copy to finish. > > I'm not sure if/how well this works with the mapped cuvid frames though. > The frame would already be unmapped and potentially re-used again > before the async copy completes. So it would need an immediately call > to Sync right after the 3 async copy calls, making the entire effort > pointless. > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
>> Additionally, could you give your opinion on the feature we also may want to add in the future, that we mentioned in the previous email? Basically, we may want to add one more CUDA function, specifically cuMemcpy2DAsync, and the possibility to set a CUStream in AVCUDADeviceContext, so it is used with cuMemcpy2DAsync instead of cuMemcpy2D in "nvdec_retrieve_data" in file libavcodec/nvdec.c. In our use case this would save up to 0.72 ms (GPU time) per frame, in case of decoding 8 fullhd frames, and up to 0.5 ms (GPU time) per frame, in case of decoding two 4k frames. This may sound too little, but for us is significant. Our software needs to do many things in a maximum of 33ms with CUDA on the GPU per frame, and we have little GPU time left. > > This is interesting and I'm considering making that the default, as it > would fit well with the current infrastructure, delaying the sync call > to the moment the frame leaves avcodec, which with the internal > re-ordering and delay should give plenty of time for the copy to finish. I'm not sure if/how well this works with the mapped cuvid frames though. The frame would already be unmapped and potentially re-used again before the async copy completes. So it would need an immediately call to Sync right after the 3 async copy calls, making the entire effort pointless. signature.asc Description: OpenPGP digital signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
On 26.04.2018 18:03, Oscar Amoros Huguet wrote: > Thanks Mark, > > You are right, we can implement in our code a sort of "av_hwdevice_ctx_set" > (which does not exist), by using av_hwdevice_ctx_alloc() + > av_hwdevice_ctx_init(). We actually use av_hwdevice_ctx_alloc in our code to > use the feature we implemented already. We are not sure about license > implications though, we link dynamically to work with LGPL. I guess both > calls are public, since they are not in "internal" labelled files. > > We are perfectly ok with using av_hwdevice_ctx_alloc() + > av_hwdevice_ctx_init() outside ffmpeg, to use our own CUDA context. By doing > so, in the current ffmpeg code, there is an internal flag " > AVCUDADeviceContextInternal.is_allocated" that is not set to 1, therefore, > the cuda context is not destroyed by ffmpeg in "cuda_device_uninit", which is > the desired behavior. > > In fact, this flag implies that the context was not allocated by ffmpeg. > Maybe this is the right flag to be used to avoid push/pop pairs when the CUDA > context is not created by ffmpeg. What do you think? > > We can adapt all of the push/pop pairs on the code, to follow this policy, > whichever flag is used. > > About the performance effects of this push/pop calls, we have seen with > NVIDIA profiling tools (NSIGHT for Visual Studio plugin), that the CUDA > runtime detects that the context you wat to set is the same as the one > currently set, so the push call does nothing and lasts 0.0016 ms in average > (CPU time). But for some reason, the cuCtxPopCurrent call, does take some > more time, and uses 0.02 ms of CPU time per call. This is 0,16 ms total per > frame when decoding 8 feeds. This is small, but it's easy to remove. I'm not a fan of touching every single bit of CUDA-related code for this. Push/Pop, specially for the context that's already active, should be free. If it's not, that's something I'd complain to nvidia about. For your specific usecase, you could build FFmpeg with a custom version of the ffnvcodec headers, that has a custom function for the push/pop ctx functions, practically noops. > Additionally, could you give your opinion on the feature we also may want to > add in the future, that we mentioned in the previous email? Basically, we may > want to add one more CUDA function, specifically cuMemcpy2DAsync, and the > possibility to set a CUStream in AVCUDADeviceContext, so it is used with > cuMemcpy2DAsync instead of cuMemcpy2D in "nvdec_retrieve_data" in file > libavcodec/nvdec.c. In our use case this would save up to 0.72 ms (GPU time) > per frame, in case of decoding 8 fullhd frames, and up to 0.5 ms (GPU time) > per frame, in case of decoding two 4k frames. This may sound too little, but > for us is significant. Our software needs to do many things in a maximum of > 33ms with CUDA on the GPU per frame, and we have little GPU time left. This is interesting and I'm considering making that the default, as it would fit well with the current infrastructure, delaying the sync call to the moment the frame leaves avcodec, which with the internal re-ordering and delay should give plenty of time for the copy to finish. signature.asc Description: OpenPGP digital signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Thanks Mark, You are right, we can implement in our code a sort of "av_hwdevice_ctx_set" (which does not exist), by using av_hwdevice_ctx_alloc() + av_hwdevice_ctx_init(). We actually use av_hwdevice_ctx_alloc in our code to use the feature we implemented already. We are not sure about license implications though, we link dynamically to work with LGPL. I guess both calls are public, since they are not in "internal" labelled files. We are perfectly ok with using av_hwdevice_ctx_alloc() + av_hwdevice_ctx_init() outside ffmpeg, to use our own CUDA context. By doing so, in the current ffmpeg code, there is an internal flag " AVCUDADeviceContextInternal.is_allocated" that is not set to 1, therefore, the cuda context is not destroyed by ffmpeg in "cuda_device_uninit", which is the desired behavior. In fact, this flag implies that the context was not allocated by ffmpeg. Maybe this is the right flag to be used to avoid push/pop pairs when the CUDA context is not created by ffmpeg. What do you think? We can adapt all of the push/pop pairs on the code, to follow this policy, whichever flag is used. About the performance effects of this push/pop calls, we have seen with NVIDIA profiling tools (NSIGHT for Visual Studio plugin), that the CUDA runtime detects that the context you wat to set is the same as the one currently set, so the push call does nothing and lasts 0.0016 ms in average (CPU time). But for some reason, the cuCtxPopCurrent call, does take some more time, and uses 0.02 ms of CPU time per call. This is 0,16 ms total per frame when decoding 8 feeds. This is small, but it's easy to remove. Additionally, could you give your opinion on the feature we also may want to add in the future, that we mentioned in the previous email? Basically, we may want to add one more CUDA function, specifically cuMemcpy2DAsync, and the possibility to set a CUStream in AVCUDADeviceContext, so it is used with cuMemcpy2DAsync instead of cuMemcpy2D in "nvdec_retrieve_data" in file libavcodec/nvdec.c. In our use case this would save up to 0.72 ms (GPU time) per frame, in case of decoding 8 fullhd frames, and up to 0.5 ms (GPU time) per frame, in case of decoding two 4k frames. This may sound too little, but for us is significant. Our software needs to do many things in a maximum of 33ms with CUDA on the GPU per frame, and we have little GPU time left. You can see all the explained above, in the following image: https://ibb.co/hASLZH Thank you for your time. Oscar ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
On 19/04/18 17:00, Oscar Amoros Huguet wrote: > > Hi! > > We changed 4 files in ffmpeg, libavcodec/nvdec.c, libavutil/hwcontext.c, > libavutil/hwcontext_cuda.h, libavutil/hwcontext_cuda.c. > > The purpose of this modification is very simple. We needed, for performance > reasons (per frame execution time), that nvdec.c used the same CUDA context > as we use in our software. > > The reason for this is not so simple, and two fold: > - We wanted to remove the overhead of having the GPU constantly switching > contexts, as we use up to 8 nvdec instances at the same time, plus a lot of > CUDA computations. > - For video syncronization and buffering purposes, after decoding we need to > download the frame from GPU to CPU, but in a non blocking and overlapped > (with computation and other transfers) manner, so the impact of the transfer > is almost zero. > > In order to do the later, we need to be able to synchronize our manually > created CUDA stream with the CUDA stream being used by ffmpeg, which by > default is the Legacy default stream. > To do so, we need to be in the same CUDA context, otherwise we don't have > access to the Legacy CUDA stream being used by ffmpeg. > > The conseqüence is, that without changin ffmpeg code, the transfer of the > frame from GPU to CPU, could not be asynchronous, because if made > asynchronous, it overlapped with the device to device cuMemcpy made > internally by ffmpeg, and therefore, the resulting frames where (many times) > a mix of two frames. > > So what did we change? > > - Outside of the ffmpeg code, we allocate an AVBufferRef with > av_hwdevice_ctx_alloc(AV_HWDEVICE_TYPE_CUDA), and we access the > AVCUDADeviceContext associated, to set the CUDA context (cuda_ctx). > - We modified libavutil/hwcontext.c call av_hwdevice_ctx_create() so it > detects that the AVBufferRef being passed, was allocaded externally. We don't > check that AVHWDeviceType is AV_HWDEVICE_TYPE_CUDA. Let us know if you think > we should check that, otherwise go back to default behavior. > - If the AVBufferRef was allocated, then we skip the allocation call, and > pass the data as AVHWDeviceContext type to cuda_device_create. > - We modified libavutil/hwcontext_cuda.c in several parts: > - cuda_device_create detects if there is a cuda context already present in > the AVCUDADeviceContext, and if so, sets the new parameter > AVCUDADeviceContext.is_ctx_externally_allocated to 1. > - This way, all the succesive calls to this file, take into account that > ffmpeg is not responsible for either the creation, thread binding/unbinding > and destruction of the CUDA context. > - Also, we skip context push and pop if the context was passed externally > (specially in non initialization calls), to reduce the number of calls to the > CUDA runtime, and improve the execution times of the CPU threads using ffmpeg. > > With this, we managed to have all the CUDA calls in the aplication, in the > same CUDA context. Also, we use CUDA default stream per-thread, so in order > to synch with the CUDA stream used by ffmpeg, we only had to put the GPU to > CPU copy, to the globally accessible cudaStreamPerThread CUDA stream. > > So, of 33ms of available time we have per frame, we save more than 6ms, that > where being used by the blocking copies from GPU to CPU. > > We considered further optimizing the code, by changing ffmpeg so it can > internally access the cudaStreamPerThread, and cuMemcpyAsynch, so the > DevicetoDevice copies are aslo asynchronous and overlapped with the rest of > the computation, but the time saved is much lower, and we have other > optimizations to do in our code, that can save more time. > > Nevetheless, if you find interesting this last optimization, let us know. > > Also, please, let us know any thing we did wrong or missed. You've missed that the main feature you are adding is already present. Look at av_hwdevice_ctx_alloc() + av_hwdevice_ctx_init(), which uses an existing device supplied by the user; av_hwdevice_ctx_create() is only for creating new devices which will be managed internally. I don't about how well the other part eliding context push/pop operations will work (someone with more Nvidia knowledge may wish to comment on that), but it shoudn't be dependent on whether the context was created externally. If you want to add that flag then it should probably be called something like "single global context" to make clear what it actually means. Also note that there are more push/pop pairs in the codebase (e.g. for NVENC and in libavfilter), and they may all need to be updated to respect this flag as well. Thanks, - Mark ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Hi! We changed 4 files in ffmpeg, libavcodec/nvdec.c, libavutil/hwcontext.c, libavutil/hwcontext_cuda.h, libavutil/hwcontext_cuda.c. The purpose of this modification is very simple. We needed, for performance reasons (per frame execution time), that nvdec.c used the same CUDA context as we use in our software. The reason for this is not so simple, and two fold: - We wanted to remove the overhead of having the GPU constantly switching contexts, as we use up to 8 nvdec instances at the same time, plus a lot of CUDA computations. - For video syncronization and buffering purposes, after decoding we need to download the frame from GPU to CPU, but in a non blocking and overlapped (with computation and other transfers) manner, so the impact of the transfer is almost zero. In order to do the later, we need to be able to synchronize our manually created CUDA stream with the CUDA stream being used by ffmpeg, which by default is the Legacy default stream. To do so, we need to be in the same CUDA context, otherwise we don't have access to the Legacy CUDA stream being used by ffmpeg. The conseqüence is, that without changin ffmpeg code, the transfer of the frame from GPU to CPU, could not be asynchronous, because if made asynchronous, it overlapped with the device to device cuMemcpy made internally by ffmpeg, and therefore, the resulting frames where (many times) a mix of two frames. So what did we change? - Outside of the ffmpeg code, we allocate an AVBufferRef with av_hwdevice_ctx_alloc(AV_HWDEVICE_TYPE_CUDA), and we access the AVCUDADeviceContext associated, to set the CUDA context (cuda_ctx). - We modified libavutil/hwcontext.c call av_hwdevice_ctx_create() so it detects that the AVBufferRef being passed, was allocaded externally. We don't check that AVHWDeviceType is AV_HWDEVICE_TYPE_CUDA. Let us know if you think we should check that, otherwise go back to default behavior. - If the AVBufferRef was allocated, then we skip the allocation call, and pass the data as AVHWDeviceContext type to cuda_device_create. - We modified libavutil/hwcontext_cuda.c in several parts: - cuda_device_create detects if there is a cuda context already present in the AVCUDADeviceContext, and if so, sets the new parameter AVCUDADeviceContext.is_ctx_externally_allocated to 1. - This way, all the succesive calls to this file, take into account that ffmpeg is not responsible for either the creation, thread binding/unbinding and destruction of the CUDA context. - Also, we skip context push and pop if the context was passed externally (specially in non initialization calls), to reduce the number of calls to the CUDA runtime, and improve the execution times of the CPU threads using ffmpeg. With this, we managed to have all the CUDA calls in the aplication, in the same CUDA context. Also, we use CUDA default stream per-thread, so in order to synch with the CUDA stream used by ffmpeg, we only had to put the GPU to CPU copy, to the globally accessible cudaStreamPerThread CUDA stream. So, of 33ms of available time we have per frame, we save more than 6ms, that where being used by the blocking copies from GPU to CPU. We considered further optimizing the code, by changing ffmpeg so it can internally access the cudaStreamPerThread, and cuMemcpyAsynch, so the DevicetoDevice copies are aslo asynchronous and overlapped with the rest of the computation, but the time saved is much lower, and we have other optimizations to do in our code, that can save more time. Nevetheless, if you find interesting this last optimization, let us know. Also, please, let us know any thing we did wrong or missed. Thanks! --- libavcodec/nvdec.c | 14 +-- libavutil/hwcontext.c | 15 --- libavutil/hwcontext_cuda.c | 97 -- libavutil/hwcontext_cuda.h | 1 + 4 files changed, 80 insertions(+), 47 deletions(-) diff --git a/libavcodec/nvdec.c b/libavcodec/nvdec.c index ab3cb88..af92218 100644 --- a/libavcodec/nvdec.c +++ b/libavcodec/nvdec.c @@ -39,6 +39,7 @@ typedef struct NVDECDecoder { AVBufferRef *hw_device_ref; CUcontextcuda_ctx; +int is_ctx_externally_allocated; CudaFunctions *cudl; CuvidFunctions *cvdl; @@ -188,6 +189,7 @@ static int nvdec_decoder_create(AVBufferRef **out, AVBufferRef *hw_device_ref, goto fail; } decoder->cuda_ctx = device_hwctx->cuda_ctx; +decoder->is_ctx_externally_allocated = device_hwctx->is_ctx_externally_allocated; decoder->cudl = device_hwctx->internal->cuda_dl; ret = cuvid_load_functions(&decoder->cvdl, logctx); @@ -370,9 +372,11 @@ static int nvdec_retrieve_data(void *logctx, AVFrame *frame) unsigned int offset = 0; int ret = 0; -err = decoder->cudl->cuCtxPushCurrent(decoder->cuda_ctx); -if (err != CUDA_SUCCESS) -return AVERROR_UNKNOWN; +if (!decoder->is_ctx_externally_allocated) { +err = decoder->cudl->cuCtxPushCu