Neutron3529 opened a new issue #15655: Performance regression for gluon dataloader with large batch size URL: https://github.com/apache/incubator-mxnet/issues/15655 ## Description gluon's dataloader performs terrible compared to `mx.io.NDArrayIter` ## Environment info (Required) ``` ----------Python Info---------- Version : 3.6.6 Compiler : MSC v.1900 64 bit (AMD64) Build : ('v3.6.6:4cf1f54eb7', 'Jun 27 2018 03:37:03') Arch : ('64bit', 'WindowsPE') ------------Pip Info----------- Version : 19.1.1 Directory : d:\program files\python36\lib\site-packages\pip ----------MXNet Info----------- Version : 1.4.1 Directory : d:\program files\python36\lib\site-packages\mxnet Commit hash file "d:\program files\python36\lib\site-packages\mxnet\COMMIT_HASH" not found. Not installed from pre-built package or built from source. Library : ['d:\\program files\\python36\\lib\\site-packages\\mxnet\\libmxnet.dll'] Build features: No runtime build feature info available ----------System Info---------- Platform : Windows-10-10.0.17758-SP0 system : Windows node : Neutron release : 10 version : 10.0.17758 ----------Hardware Info---------- machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel Name Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0249 sec, LOAD: 2.2043 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.5834 sec, LOAD: 1.2048 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0020 sec, LOAD: 0.9903 sec. Error open FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:841)>, DNS finished in 0.1266636848449707 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.2234 sec, LOAD: 4.6954 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.3620 sec, LOAD: 0.7789 sec. ``` ## Minimum reproducible example ## Steps to reproduce ``` import mxnet as mx def data_xform(data): """Move channel axis to the beginning, cast to float32, and normalize to [0, 1].""" return mx.ndarray.moveaxis(data, 2, 0).astype('float32') / 255 train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform) train_loader = mx.gluon.data.DataLoader(train_data, shuffle=False, batch_size=10000) a=train_loader.__iter__() from time import time t=time() _=next(a) print(time()-t) ``` It took `3.6711745262145996` seconds to execute a `next(a)`, roughly 20s for total 6 `next(a)` For `mx.io.NDArrayIter`, it finish iter almost immediately. ## What have you tried to solve it? I find the problem occurs [here](https://github.com/apache/incubator-mxnet/blob/8158ba4b0f1ebd696ec09a0b1aa6031bacb60740/python/mxnet/gluon/data/dataloader.py#L371), but I cannot fix it. What's more, with changing `batch_size` to `10000`, NDArray API failed to optimize the [MNIST model](https://mxnet.incubator.apache.org/versions/master/tutorials/python/mnist.html): ``` ...... >>> batch_size = 10000 >>> train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True) >>> val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size) ...... >>> import logging >>> logging.getLogger().setLevel(logging.DEBUG) # logging to stdout >>> # create a trainable module on compute context ... mlp_model = mx.mod.Module(symbol=mlp, context=ctx) >>> mlp_model.fit(train_iter, # train data ... eval_data=val_iter, # validation data ... optimizer='sgd', # use SGD to train ... optimizer_params={'learning_rate':0.1}, # use fixed learning rate ... eval_metric='acc', # report accuracy during training ... batch_end_callback = mx.callback.Speedometer(batch_size, 100), # output progress for each 100 data batches ... num_epoch=10) # train for at most 10 dataset passes INFO:root:Epoch[0] Train-accuracy=0.107000 INFO:root:Epoch[0] Time cost=0.123 INFO:root:Epoch[0] Validation-accuracy=0.113500 INFO:root:Epoch[1] Train-accuracy=0.112367 INFO:root:Epoch[1] Time cost=0.183 INFO:root:Epoch[1] Validation-accuracy=0.113500 INFO:root:Epoch[2] Train-accuracy=0.112367 INFO:root:Epoch[2] Time cost=0.158 INFO:root:Epoch[2] Validation-accuracy=0.113500 INFO:root:Epoch[3] Train-accuracy=0.112367 INFO:root:Epoch[3] Time cost=0.504 INFO:root:Epoch[3] Validation-accuracy=0.113500 INFO:root:Epoch[4] Train-accuracy=0.112367 INFO:root:Epoch[4] Time cost=0.142 INFO:root:Epoch[4] Validation-accuracy=0.113500 INFO:root:Epoch[5] Train-accuracy=0.112367 INFO:root:Epoch[5] Time cost=0.164 INFO:root:Epoch[5] Validation-accuracy=0.113500 INFO:root:Epoch[6] Train-accuracy=0.112367 INFO:root:Epoch[6] Time cost=0.471 INFO:root:Epoch[6] Validation-accuracy=0.113500 INFO:root:Epoch[7] Train-accuracy=0.112367 INFO:root:Epoch[7] Time cost=0.167 INFO:root:Epoch[7] Validation-accuracy=0.113500 INFO:root:Epoch[8] Train-accuracy=0.112367 INFO:root:Epoch[8] Time cost=0.241 INFO:root:Epoch[8] Validation-accuracy=0.113500 INFO:root:Epoch[9] Train-accuracy=0.112367 INFO:root:Epoch[9] Time cost=0.307 INFO:root:Epoch[9] Validation-accuracy=0.113500 ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
