Hi, Do we have a conclusion on whether this patch can be pushed in?
Thanks, Ben On Tue, Jan 30, 2018 at 4:25 PM, Ben Chang <benchang...@gmail.com> wrote: > > > On Fri, Jan 26, 2018 at 3:10 PM, Mark Thompson <s...@jkqxz.net> wrote: > >> On 26/01/18 20:51, Ben Chang wrote: >> > On Fri, Jan 26, 2018 at 3:32 AM, Mark Thompson <s...@jkqxz.net> wrote: >> > >> >> On 26/01/18 09:06, Ben Chang wrote: >> >>> Thanks for the review Mark. >> >>> >> >> To clarify, since it is less clear now with the trimmed context: my two >> comments about this change are completely independent. (Given that, maybe >> it should be split into two parts - one for hwcontext and one for nvenc?) > > Sorry for the delay in reply Mark; been caught up by something else. > >> > > >> This part is about the change to NVENC: >> >> >>> There are some cuda kernels in the driver that may be invoked >> depending >> >> on >> >>> the nvenc operations specified in the commandline. My observation from >> >>> looking at the nvcc statistics is that most stack frame size for these >> >> cuda >> >>> kernels are 0 (highest observed was 120 bytes). >> >> >> >> Right, that makes sense. If Nvidia is happy that this will always >> work in >> >> drivers compatible with this API version (including any future ones) >> then >> >> sure. >> >> >> > I am not saying this should be the "permanent" value for stack frame >> size >> > per GPU thread. However, at this moment (looking at existing cuda >> kernels >> > that devs have control over), I do not see this reduction as an issue. >> >> I think you should be confident that the chosen value here will last well >> into the future for NVENC use. Consider that this will end up in releases >> - if a future Nvidia driver update happens to need a larger stack then all >> previous releases and binaries will stop working for all users. >> >> >> This part is about the change to the hwcontext device creation: >> >> >>>> This is technically a user-visible change, since it will apply to all >> >> user >> >>>> programs run on the CUDA context created here as well as those inside >> >>>> ffmpeg. I'm not sure how many people actually use that, though, so >> >> maybe >> >>>> it won't affect anyone. >> >>>> >> >>> In ffmpeg, I see vf_thumbnail_cuda and vf_scale_cuda available (not >> sure >> >> if >> >>> there is more, but these two should not be affected by this >> reduction). >> >>> User can always raise the stack limit size if their own custom kernel >> >>> require higher stack frame size. >> >> >> >> I don't mean filters inside ffmpeg, I mean a user program which >> probably >> >> uses NVDEC and/or NVENC (and possibly other things) from libavcodec but >> >> then does its own CUDA processing with the same context. This is >> silently >> >> changing the setup underneath it, and 128 feels like a very small >> number. >> >> >> > Yes, this is really a trade off between reducing memory usage (since >> there >> > are numerous complaints of high memory usage preventing having more >> ffmpeg >> > instances) and user convenience (custom cuda implementation may be >> > impacted). My thought (which can be wrong) is that users who implement >> > their own cuda kernel may have better knowledge about cuda (eg. how much >> > stack frame size their kernel needs or use cuda debugger to find out >> what >> > issue they may have). The size of the kernels are really implementation >> > dependent (eg, allocating arrays in stacks or heap, recursions, how much >> > register spills, etc) so stack frame sizes may vary widely. The default, >> > 1024 bytes, may not be enough at times and user will need to adjust the >> > stack limit accordingly anyway. >> >> Note that since you are changing a library the users in this context are >> all ffmpeg library users. So, it means any program or library which uses >> ffmpeg, and transitively anyone who uses them. The end-user need not be >> informed about CUDA at all. >> >> (What you've said also makes it sound like it can change by compiler >> version, but I guess such changes should be small.) >> >> >>>> >> >>>> If the stack limit is violated, what happens? Will that be undefined >> >>>> behaviour with random effects (crash / incorrect results), or is it >> >> likely >> >>>> to be caught at program compile/load-time? >> >>>> >> >>> Stack will likely overflow and kernel will terminate (though I have >> yet >> >>> encounter this before). >> >> >> >> As long as the user gets a clear message that a stack overflow has >> >> occurred so that they can realise that they need to raise the value >> then it >> >> should be fine. >> > >> > I believe you will see stack overflow if attached to cuda debugger. But >> the >> > default error may just be kernel launch error/failure. This goes back >> to my >> > opinion that cuda developer should figure this out relatively easy if >> they >> > want to customize the cuda part of their program. >> >> Ok, sure. It's possibly unfortunate if a user who knows nothing about >> CUDA rebuilds a program from source with newer ffmpeg libraries including >> this change because something could stop working in an opaque way, but if >> it errors out then it won't silently give the wrong answer and should be >> diagnosable by the original developer. > > I have asked my colleagues about the effects of cuda kernel's stack frame > exceeding the default stack limit size. Cuda driver should automatically > increase the stack limit / allocate more memory. It will fail if there is > no more memory available. > > I have done some experiments to validate this (by creating various sizes > of arrays on stack in a cuda kernel). This is based on setting > cuCtxSetLimit to 128 bytes in stack limit in ffmpeg. Having about 40000 > bytes of stack frame size in kernel will not result in failure (extra > memory allocated by driver observed in GPU-Z). Its only when I have > something ridiculous like 300,000 stack frame size does the encode fail > with kernel launch failure. > > Therefore, I do not think this patch of reducing default stack limit will > cause regression for users (both in nvenc & hwcontext device creation). > > In short, I am trying to say there is a fallback if the initial stack > limit set if not enough. This fallback only fails there is no more memory > available on gpu (in this cause, user will stumble upon this even with the > default 1024 bytes). > > Thanks, > Ben > > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel