reminisce commented on issue #16716: [Numpy] Fix collect_params().zero_grad() in gluon numpy interface URL: https://github.com/apache/incubator-mxnet/pull/16716#issuecomment-551241818 @ptrendx The the performance overhead in your benchmark really comes from the FFI and pushing ops to the async engine. It becomes more obvious when the kernel execution is negligible. We are working on reducing the operator calling overhead. Except that, just from the pure code analysis, `reset_arrays` requires all input ndarrays to be write-ready to run in the same cuda stream, while the other way has no such restriction. Your benchmark case might be special to be in favor of `reset_array`, but this op is exposed as a public API and we cannot prevent users from using it in other cases. > sync at the end is 10+x that. I'm not sure what is the `sync` here you are referring to. Could you explain?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
