ptrendx commented on pull request #18622:
URL: https://github.com/apache/incubator-mxnet/pull/18622#issuecomment-670771199


   @eric-haibin-lin Yes. The overhead comes from preparing a string with kernel 
options (like the datatypes) and searching for the kernel function in cache. 
CUDA graph caches the resulting function so the lookup does not occur anymore.
   
   That said, this overhead is lower than the overhead of `cudaLaunchKernel` 
itself and is barely noticeable - I tried it with a worst case scenario of 
fully hybridized model that was adding tensors with single element (to be 100% 
CPU limited) and got ~10% slowdown. More realistic workload with kernels taking 
longer than a few us would not show any difference. The same CPU limited test 
with non-hybridized model did not show noticeable slowdown (overheads of 
imperative mode are way higher than this).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to