bputrycz commented on issue #13530: Integrate MKLDNN Conv1d and support 3d 
layout
URL: https://github.com/apache/incubator-mxnet/pull/13530#issuecomment-451302666
 
 
   I noticed a very nice improvement with this change.
   So, thank you.
   
   Still, for my use-case: conv1d, small batch size, small channel dimension, 
long sequence length,
   I don't see much improvement with more cores added to the computations.
   
   The simplified snippet to reproduce (conv1d.py):
   ```
   import mxnet as mx
   from mxnet import gluon, nd
   
   from mxnet import profiler
   profiler.set_config(profile_all=True, aggregate_stats=True, 
filename='profile_output.json')
    
   channels = 64
   
   net = gluon.nn.Sequential()
   
   conv = gluon.nn.Conv1D(channels, 4, padding=1)
   act = gluon.nn.Activation('sigmoid')
   
   for i in range(3):
       net.add(conv)
       net.add(act)
   
   net.initialize()
   
   data = nd.random.uniform(shape=(1, channels, 2**16))
   
   # Warm-up
   y = net(data)
   nd.waitall()
   
   profiler.set_state('run')
   for i in range(10):
       y = net(data)
       nd.waitall()
   profiler.set_state('stop')
   
   print(profiler.dumps())
   ```
   When run on a host with a lot of cores (AWS c4.8xlarge) results in:
   ```
   $ OMP_NUM_THREADS=1 python conv1d.py | grep "Convolution\|Activation"
   Activation                             60         244.8700           3.3960  
         9.7510           4.0812
   Convolution                            60        1648.7010          24.9280  
        40.6520          27.4783
   $ OMP_NUM_THREADS=2 python conv1d.py | grep "Convolution\|Activation"
   Activation                             60         127.4460           1.6600  
         5.4070           2.1241
   Convolution                            60         866.8680          12.6670  
        22.9810          14.4478
   $ OMP_NUM_THREADS=4 python conv1d.py | grep "Convolution\|Activation"
   Activation                             60          65.3190           0.8940  
         2.9280           1.0886
   Convolution                            60         854.2900          12.6230  
        20.3230          14.2382
   ```
   
   There is no improvement when number of threads is increased to 4, or more.
   
   Playing more with this example, for higher 'channels' value, it starts to be 
a little better parallelizable.
   So, it seems the parallelization is done only per a single sequence "point". 
Is it the case?
   But, it seems quite natural to parallelize also along the sequence, 
especially when it is long - different threads doing a different part of the 
sequence.
   Then, parallelization should scale linearly with number of threads.
   Isn't it done like that?
   
   Bartosz
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to