zhreshold commented on issue #15655: Performance regression for gluon dataloader with large batch size URL: https://github.com/apache/incubator-mxnet/issues/15655#issuecomment-515703642 @Neutron3529 Unfortunately for DataLoader, it has to allocate additional memory as you iterate through the dataset, and it will involve mx.nd.stack operator to batch images, which means mxnet engine will take control. In comparison, NDArrayIter or pure numpy array iteration won't trigger additional overheads. This problem is rather visible to small workloads, i.e., for MNIST. However, for large network training, several seconds is merely nothing compared to per epoch training or validation time (mins). In fact if you have multi-core cpu you can speed up the process by utilizing multi_worker in this case ```python import mxnet as mx from mxnet import nd import time def data_xform(data): """Move channel axis to the beginning, cast to float32, and normalize to [0, 1].""" return nd.moveaxis(data, 2, 0).astype('float32') / 255 def bench_time(num_workers=0): print('-----\nnum_workers:', num_workers) tic = time.time() train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform) val_data = mx.gluon.data.vision.MNIST(train=False).transform_first(data_xform) t1 = time.time() - tic tic = time.time() batch_size = 100#set to 10000 produce the same result. train_loader = mx.gluon.data.DataLoader(train_data, shuffle=True, batch_size=batch_size, num_workers=num_workers) val_loader = mx.gluon.data.DataLoader(val_data, shuffle=False, batch_size=batch_size, num_workers=num_workers) for i,j in train_loader: pass t2 = time.time() - tic print('t1', t1, 't2', t2) if __name__ == '__main__': bench_time(0) bench_time(4) ``` ```bash ----- num_workers: 0 t1 0.35317301750183105 t2 8.19723916053772 ----- num_workers: 4 t1 0.2771739959716797 t2 3.3613219261169434 ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
