reminisce commented on issue #16716: [Numpy] Fix collect_params().zero_grad() 
in gluon numpy interface
URL: https://github.com/apache/incubator-mxnet/pull/16716#issuecomment-551962930
 
 
   @ptrendx Thanks for the script. I think a large part of overhead for zeroing 
ndarrays individually in Python comes from ndarray indexing, FFI, and pushing 
operators to the async engine. I modified your script a little bit to 
demonstrate the point.
   ```python
     import mxnet as mx
    import time
    
    arrays = [mx.nd.ones((100,100), ctx=mx.gpu()) for _ in range(500)]
    
    for a in arrays:
        a[:] = 0
    
    num_repeats = 10
    
    mx.nd.waitall()
    start = time.time()
    #for _ in range(num_repeats):
    for a in arrays:
        mx.nd.zeros(a.shape, out=a)
    end = time.time()
    #print("async push per `mx.nd.zeros`: Elapsed ", (end - start) / 
num_repeats / len(arrays))
    print("async push per `mx.nd.zeros`: Elapsed ", (end - start) / len(arrays))
    
    mx.nd.waitall()
    start = time.time()
    for _ in range(num_repeats):
        for a in arrays:
            mx.nd.zeros(a.shape, out=a)
    mx.nd.waitall()
    end = time.time()
    #print("normal: Elapsed ", (end - start) / num_repeats)
    print("normal: Elapsed ", (end - start))
    
    mx.nd.waitall()
    start = time.time()
    for _ in range(num_repeats):
        with mx.engine.bulk(len(arrays)):
            for a in arrays:
                mx.nd.zeros(a.shape, out=a)
    mx.nd.waitall()
    end = time.time()
    #print("bulk: Elapsed ", (end - start) / num_repeats)
    print("bulk: Elapsed ", (end - start))
    
    mx.nd.waitall()
    start = time.time()
    for _ in range(100):
        mx.nd.reset_arrays(*arrays, num_arrays=len(arrays))
    end = time.time()
    print("async push per `reset_arrays`: Elapsed ", (end - start) / 100)
    #print("reset_arrays: Elapsed ", (end - start) / num_repeats)
    
    mx.nd.waitall()
    start = time.time()
    for _ in range(num_repeats):
        mx.nd.reset_arrays(*arrays, num_arrays=len(arrays))
    mx.nd.waitall()
    end = time.time()
    print("reset_arrays: Elapsed ", (end - start))
    #print("reset_arrays: Elapsed ", (end - start) / num_repeats)
   ```
   and got results
   ```
   async push per `mx.nd.zeros`: Elapsed  7.888364791870118e-05
   normal: Elapsed  0.3912644386291504
   bulk: Elapsed  0.3276066780090332
   async push per `reset_arrays`: Elapsed  0.0005680346488952637
   reset_arrays: Elapsed  0.019466638565063477
   ```
   If you calculate the overhead of invoking zeroing 500 ndarrays with 10 
repeats (roughly excluding the kernel execution time), it's 
`8.108711242675781e-05 * 500 * 10 = 0.40543556213378906` seconds. This is just 
an estimated number, but it shows how significant the accumulated overhead of 
invoking operators is for small ops.
   
   I agree in this situation, we should keep `reset_arrays` as an intermediate 
solution to keep the performance on par, and we will continue to optimize the 
latency of invoking operators.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to