[GitHub] [incubator-mxnet] Neutron3529 opened a new issue #15655: Performance regression for gluon dataloader with large batch size

GitBox Wed, 24 Jul 2019 23:26:11 -0700

Neutron3529 opened a new issue #15655: Performance regression for gluon 
dataloader with large batch size
URL: https://github.com/apache/incubator-mxnet/issues/15655
 
 
   ## Description
   
   gluon's dataloader performs terrible compared to `mx.io.NDArrayIter`
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.6.6
   Compiler     : MSC v.1900 64 bit (AMD64)
   Build        : ('v3.6.6:4cf1f54eb7', 'Jun 27 2018 03:37:03')
   Arch         : ('64bit', 'WindowsPE')
   ------------Pip Info-----------
   Version      : 19.1.1
   Directory    : d:\program files\python36\lib\site-packages\pip
   ----------MXNet Info-----------
   Version      : 1.4.1
   Directory    : d:\program files\python36\lib\site-packages\mxnet
   Commit hash file "d:\program 
files\python36\lib\site-packages\mxnet\COMMIT_HASH" not found. Not installed 
from pre-built package or built from source.
   Library      : ['d:\\program 
files\\python36\\lib\\site-packages\\mxnet\\libmxnet.dll']
   Build features:
   No runtime build feature info available
   ----------System Info----------
   Platform     : Windows-10-10.0.17758-SP0
   system       : Windows
   node         : Neutron
   release      : 10
   version      : 10.0.17758
   ----------Hardware Info----------
   machine      : AMD64
   processor    : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
   Name
   Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
   
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0249 
sec, LOAD: 2.2043 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.5834 sec, LOAD: 
1.2048 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0020 sec, LOAD: 
0.9903 sec.
   Error open FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed 
(_ssl.c:841)>, DNS finished in 0.1266636848449707 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.2234 sec, LOAD: 
4.6954 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.3620 sec, 
LOAD: 0.7789 sec.
   ```
   
   ## Minimum reproducible example
   ## Steps to reproduce
   ```
   import mxnet as mx
   def data_xform(data):
       """Move channel axis to the beginning, cast to float32, and normalize to 
[0, 1]."""
       return mx.ndarray.moveaxis(data, 2, 0).astype('float32') / 255
   
   train_data = 
mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform)
   train_loader = mx.gluon.data.DataLoader(train_data, shuffle=False, 
batch_size=10000)
   a=train_loader.__iter__()
   from time import time
   t=time()
   _=next(a)
   print(time()-t)
   ```
   It took `3.6711745262145996` seconds to execute a `next(a)`, roughly 20s for 
total 6 `next(a)`
   For `mx.io.NDArrayIter`, it finish iter almost immediately.
   
   ## What have you tried to solve it?
   I find the problem occurs 
[here](https://github.com/apache/incubator-mxnet/blob/8158ba4b0f1ebd696ec09a0b1aa6031bacb60740/python/mxnet/gluon/data/dataloader.py#L371),
 but I cannot fix it.
   
   What's more, with changing `batch_size` to `10000`, NDArray API failed to 
optimize the [MNIST 
model](https://mxnet.incubator.apache.org/versions/master/tutorials/python/mnist.html):
   ```
   ......
   >>> batch_size = 10000
   >>> train_iter = mx.io.NDArrayIter(mnist['train_data'], 
mnist['train_label'], batch_size, shuffle=True)
   >>> val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], 
batch_size)
   ......
   >>> import logging
   >>> logging.getLogger().setLevel(logging.DEBUG)  # logging to stdout
   >>> # create a trainable module on compute context
   ... mlp_model = mx.mod.Module(symbol=mlp, context=ctx)
   >>> mlp_model.fit(train_iter,  # train data
   ...               eval_data=val_iter,  # validation data
   ...               optimizer='sgd',  # use SGD to train
   ...               optimizer_params={'learning_rate':0.1},  # use fixed 
learning rate
   ...               eval_metric='acc',  # report accuracy during training
   ...               batch_end_callback = mx.callback.Speedometer(batch_size, 
100), # output progress for each 100 data batches
   ...               num_epoch=10)  # train for at most 10 dataset passes
   INFO:root:Epoch[0] Train-accuracy=0.107000
   INFO:root:Epoch[0] Time cost=0.123
   INFO:root:Epoch[0] Validation-accuracy=0.113500
   INFO:root:Epoch[1] Train-accuracy=0.112367
   INFO:root:Epoch[1] Time cost=0.183
   INFO:root:Epoch[1] Validation-accuracy=0.113500
   INFO:root:Epoch[2] Train-accuracy=0.112367
   INFO:root:Epoch[2] Time cost=0.158
   INFO:root:Epoch[2] Validation-accuracy=0.113500
   INFO:root:Epoch[3] Train-accuracy=0.112367
   INFO:root:Epoch[3] Time cost=0.504
   INFO:root:Epoch[3] Validation-accuracy=0.113500
   INFO:root:Epoch[4] Train-accuracy=0.112367
   INFO:root:Epoch[4] Time cost=0.142
   INFO:root:Epoch[4] Validation-accuracy=0.113500
   INFO:root:Epoch[5] Train-accuracy=0.112367
   INFO:root:Epoch[5] Time cost=0.164
   INFO:root:Epoch[5] Validation-accuracy=0.113500
   INFO:root:Epoch[6] Train-accuracy=0.112367
   INFO:root:Epoch[6] Time cost=0.471
   INFO:root:Epoch[6] Validation-accuracy=0.113500
   INFO:root:Epoch[7] Train-accuracy=0.112367
   INFO:root:Epoch[7] Time cost=0.167
   INFO:root:Epoch[7] Validation-accuracy=0.113500
   INFO:root:Epoch[8] Train-accuracy=0.112367
   INFO:root:Epoch[8] Time cost=0.241
   INFO:root:Epoch[8] Validation-accuracy=0.113500
   INFO:root:Epoch[9] Train-accuracy=0.112367
   INFO:root:Epoch[9] Time cost=0.307
   INFO:root:Epoch[9] Validation-accuracy=0.113500
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] Neutron3529 opened a new issue #15655: Performance regression for gluon dataloader with large batch size

Reply via email to