reminisce commented on issue #16716: [Numpy] Fix collect_params().zero_grad() 
in gluon numpy interface
URL: https://github.com/apache/incubator-mxnet/pull/16716#issuecomment-551241818
 
 
   @ptrendx The the performance overhead in your benchmark really comes from 
the FFI and pushing ops to the async engine. It becomes more obvious when the 
kernel execution is negligible. We are working on reducing the operator calling 
overhead. Except that, just from the pure code analysis, `reset_arrays` 
requires all input ndarrays to be write-ready to run in the same cuda stream, 
while the other way has no such restriction. Your benchmark case might be 
special to be in favor of `reset_array`, but this op is exposed as a public API 
and we cannot prevent users from using it in other cases.
   
   > sync at the end is 10+x that.
   
   I'm not sure what is the `sync` here you are referring to. Could you explain?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to