KeyKy opened a new issue #7448: out of memory when training imagenet with .rec file. URL: https://github.com/apache/incubator-mxnet/issues/7448 ## Environment info Operating System: ubuntu 16.04 Compiler: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609 Package used (Python/R/Scala/Julia): Python MXNet commit hash: 8ad3c8a7a98dfa6bd6f5065cf9c3688f2414c3d4 Python version and distribution: Python2.7.12 ## Error Message: The usage of memory, at the beginning of training is 5.7%-7.9% , after a while (3~4h) it takes 27% and finally it run out of my memory (waked up by the alarm message). ## Steps to reproduce or if you are running standard examples, please provide the commands you have run that lead to the error. 1. cd examples/image-classification && python train_imagenet.py --network my_net --gpus 0,1,2,3 --num-epochs 100 --lr 0.01 --lr-step-epochs 30,60,80,110 --batch-size 256 --top-k 5 --data-train /data_shared/datasets/ILSVRC2015/rec/train_480_q100.rec --data-val /data_shared/datasets/ILSVRC2015/rec/val_480_q100.rec --rgb-mean 123.68,116.779,103.939 --data-nthreads 4 --model-prefix ./my_net ## What have you tried to solve it? 1. set the prefetch_buffer = 1 but the accuracy of my model drop to 20% and set prefetch_buffer back to 2,4,8, the accuracy is right! 2. cpu memory increase continually when set the prefetch_buffer = 1 3. also find some similar issues: https://github.com/apache/incubator-mxnet/issues/1411 https://github.com/apache/incubator-mxnet/issues/3183 https://github.com/apache/incubator-mxnet/issues/2969 https://github.com/apache/incubator-mxnet/issues/2111 https://github.com/apache/incubator-mxnet/issues/2099 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services