roywei opened a new issue #15430: Performance regression on training resnet152 
with CIFAR10 on CPU
URL: https://github.com/apache/incubator-mxnet/issues/15430
 
 
   Follow up on dev list discussion:
   
   
https://lists.apache.org/thread.html/154ef1e4010671e7375c7a7cbedb413d5a4a3677321488440fb32a3a@%3Cdev.mxnet.apache.org%3E
   
   We have found resnet152 to have a regression when training CIFAR10 dataset 
on CPU (C5x18Large)
   
   To summarize the findings:
   
   Scripts/Model: 
https://github.com/awslabs/deeplearning-benchmark/blob/master/dawnbench/cifar10.py
 
   Total 20 epochs. First 10 epochs for warm-up
   
   With MXNet 1.4.1 average time is 164.23 s
   With MXNet 1.5.0 average time is 174.59 s (~6.3% regression)
   (1.5.0 version: pip install mxnet-mkl==1.5.0b20190619 which
   corresponds to commit# ccbbf6b4b76ea536a6583c99497c83b65a20817b which is
   behind 1.5.x branch by 4 commits)
   
   If total 50 epochs, first 10 epoch warm up and run with fixed seed:
   1.4.1: 164.95 s
   1.5.0: 170.44 s
   Detailed data at [1]
   This is about 3% regression
   (1.5.0 version: 1.5.0rc2 release candidate build from source with MKLDNN )
   
   Gluon Resnet Model:
   Gluon speed test benchmark script -
   
https://github.com/apache/incubator-mxnet/blob/master/benchmark/python/gluon/benchmark_gluon.py
   using the following command:
   python3 benchmark_gluon.py --model 'resnet152_v2' --batch-size 128
   --num-batches 200 --type 'training'
   
   I got the following speeds:
   With MXNet 1.4.1, average speed is 25.677534 img/s
   With MXNet 1.5.0, average speed is 25.082130 img/s (~2.3% regression)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to