bputrycz commented on issue #13530: Integrate MKLDNN Conv1d and support 3d layout URL: https://github.com/apache/incubator-mxnet/pull/13530#issuecomment-451302666 I noticed a very nice improvement with this change. So, thank you. Still, for my use-case: conv1d, small batch size, small channel dimension, long sequence length, I don't see much improvement with more cores added to the computations. The simplified snippet to reproduce (conv1d.py): ``` import mxnet as mx from mxnet import gluon, nd from mxnet import profiler profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json') channels = 64 net = gluon.nn.Sequential() conv = gluon.nn.Conv1D(channels, 4, padding=1) act = gluon.nn.Activation('sigmoid') for i in range(3): net.add(conv) net.add(act) net.initialize() data = nd.random.uniform(shape=(1, channels, 2**16)) # Warm-up y = net(data) nd.waitall() profiler.set_state('run') for i in range(10): y = net(data) nd.waitall() profiler.set_state('stop') print(profiler.dumps()) ``` When run on a host with a lot of cores (AWS c4.8xlarge) results in: ``` $ OMP_NUM_THREADS=1 python conv1d.py | grep "Convolution\|Activation" Activation 60 244.8700 3.3960 9.7510 4.0812 Convolution 60 1648.7010 24.9280 40.6520 27.4783 $ OMP_NUM_THREADS=2 python conv1d.py | grep "Convolution\|Activation" Activation 60 127.4460 1.6600 5.4070 2.1241 Convolution 60 866.8680 12.6670 22.9810 14.4478 $ OMP_NUM_THREADS=4 python conv1d.py | grep "Convolution\|Activation" Activation 60 65.3190 0.8940 2.9280 1.0886 Convolution 60 854.2900 12.6230 20.3230 14.2382 ``` There is no improvement when number of threads is increased to 4, or more. Playing more with this example, for higher 'channels' value, it starts to be a little better parallelizable. So, it seems the parallelization is done only per a single sequence "point". Is it the case? But, it seems quite natural to parallelize also along the sequence, especially when it is long - different threads doing a different part of the sequence. Then, parallelization should scale linearly with number of threads. Isn't it done like that? Bartosz
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
