aseyboldt commented on issue #8219: Broadcasting ops are slow
URL: 
https://github.com/apache/incubator-mxnet/issues/8219#issuecomment-349621811
 
 
   @cjolivier01 Thanks for looking into this. ?
   
   I haven't updated to mxnet 1.0 yet, so it is possible that this is fixed now 
(I only have slow internet at the moment so I can't update). Looking at the 
code I don't think so however.
   
   The broadcasting `array + array` shouldn't be much slower than plain `array 
+ array`, especially if one of the arrays is smaller than the other, as that 
helps a lot with the memory bandwidth. Memory bandwidth should be the limiting 
factor in simple ops on large arrays. This can be seen when we compare to numpy:
   
   ```python
   import os
   os.environ['OMP_NUM_THREADS'] = '1'
   
   import numpy as np
   import mxnet as mx
   import time
   
   a = mx.sym.var('a')
   b = mx.sym.var('b')
   
   a_ = mx.nd.ones((2**17, 10, 10))
   b_ = mx.nd.ones((1,))
   c_ = a_.copy()
   
   x = a_.asnumpy()
   y = b_.asnumpy()
   z = c_.asnumpy()
   
   func1 = (a + b).bind(mx.cpu(), {'a': a_, 'b': c_})
   func2 = mx.sym.broadcast_add(a, b).bind(mx.cpu(), {'a': a_, 'b': b_})
   
   for _ in range(2):
       # elemwise
       start = time.time()
       for i in range(100):
           func1.forward()[0].wait_to_read()
       print("func1: {}".format(time.time() - start))
   
   
       # boadcast_add(array, array)
       start = time.time()
       for i in range(100):
           func2.forward()[0].wait_to_read()
       print("func2: {}".format(time.time() - start))
   
       # numpy elemwise
       start = time.time()
       out = np.zeros_like(x)
       for i in range(100):
           np.add(x, z, out=out)
       print("numpy1: {}".format(time.time() - start))
       
       # numpy broadcast
       start = time.time()
       for i in range(100):
           np.add(x, y, out=out)
       print("numpy2: {}".format(time.time() - start))
       
       print()
   ```
   
   which gives me (different machine than the last benchmark)
   ```
   func1: 0.9796142578125
   func2: 9.832738876342773
   numpy1: 0.9367139339447021
   numpy2: 0.6408178806304932
   
   func1: 0.927008867263794
   func2: 10.026437997817993
   numpy1: 1.091845989227295
   numpy2: 0.646554708480835
   ```
   
   For numpy the broadcasting op is *faster* than the normal one, for mxnet it 
is 10x slower.
   
   In the non-broadcasting case both numpy and mxnet are bound by memory 
bandwidth, and this is still more or less the case in the broadcasting case for 
numpy, but not for mxnet. This seems to happen in general for the broadcasting 
ops in mxnet, not only when a scalar is added. (Although numpy can't use up all 
the memory bandwidth in some cases either, it never slows down nearly as much 
as mxnet)
   
   My guess as to why func2 is so much slower than func1 is that the index 
juggling in `ravel` and `unravel` takes time and messes up prefetching. Other 
explanations could be that maybe some array is traversed in the wrong order 
(but I don't think this is the case) or that the branch because of `addto` 
slows things down (but I don't see how that would be a factor of 10).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to