reminisce commented on issue #16716: [Numpy] Fix collect_params().zero_grad() in gluon numpy interface URL: https://github.com/apache/incubator-mxnet/pull/16716#issuecomment-551962930 @ptrendx Thanks for the script. I think a large part of overhead for zeroing ndarrays individually in Python comes from ndarray indexing, FFI, and pushing operators to the async engine. I modified your script a little bit to demonstrate the point. ```python import mxnet as mx import time arrays = [mx.nd.ones((100,100), ctx=mx.gpu()) for _ in range(500)] for a in arrays: a[:] = 0 num_repeats = 10 mx.nd.waitall() start = time.time() #for _ in range(num_repeats): for a in arrays: mx.nd.zeros(a.shape, out=a) end = time.time() #print("async push per `mx.nd.zeros`: Elapsed ", (end - start) / num_repeats / len(arrays)) print("async push per `mx.nd.zeros`: Elapsed ", (end - start) / len(arrays)) mx.nd.waitall() start = time.time() for _ in range(num_repeats): for a in arrays: mx.nd.zeros(a.shape, out=a) mx.nd.waitall() end = time.time() #print("normal: Elapsed ", (end - start) / num_repeats) print("normal: Elapsed ", (end - start)) mx.nd.waitall() start = time.time() for _ in range(num_repeats): with mx.engine.bulk(len(arrays)): for a in arrays: mx.nd.zeros(a.shape, out=a) mx.nd.waitall() end = time.time() #print("bulk: Elapsed ", (end - start) / num_repeats) print("bulk: Elapsed ", (end - start)) mx.nd.waitall() start = time.time() for _ in range(100): mx.nd.reset_arrays(*arrays, num_arrays=len(arrays)) end = time.time() print("async push per `reset_arrays`: Elapsed ", (end - start) / 100) #print("reset_arrays: Elapsed ", (end - start) / num_repeats) mx.nd.waitall() start = time.time() for _ in range(num_repeats): mx.nd.reset_arrays(*arrays, num_arrays=len(arrays)) mx.nd.waitall() end = time.time() print("reset_arrays: Elapsed ", (end - start)) #print("reset_arrays: Elapsed ", (end - start) / num_repeats) ``` and got results ``` async push per `mx.nd.zeros`: Elapsed 7.888364791870118e-05 normal: Elapsed 0.3912644386291504 bulk: Elapsed 0.3276066780090332 async push per `reset_arrays`: Elapsed 0.0005680346488952637 reset_arrays: Elapsed 0.019466638565063477 ``` If you calculate the overhead of invoking zeroing 500 ndarrays with 10 repeats (roughly excluding the kernel execution time), it's `8.108711242675781e-05 * 500 * 10 = 0.40543556213378906` seconds. This is just an estimated number, but it shows how significant the accumulated overhead of invoking operators is for small ops. I agree in this situation, we should keep `reset_arrays` as an intermediate solution to keep the performance on par, and we will continue to optimize the latency of invoking operators.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
