dimon777 opened a new issue #9692: cifar10 training on P100 seems not converging
URL: https://github.com/apache/incubator-mxnet/issues/9692
 
 
   ## Description
   Training cifar10 on two GPU P100 with defaults doesn't seem to converge
   
   ## Environment info (Required)
   ```
   # python diagnose.py 
   ----------Python Info----------
   ('Version      :', '2.7.12')
   ('Compiler     :', 'GCC 5.4.0 20160609')
   ('Build        :', ('default', 'Dec  4 2017 14:50:18'))
   ('Arch         :', ('64bit', 'ELF'))
   ------------Pip Info-----------
   ('Version      :', '9.0.1')
   ('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
   ----------MXNet Info-----------
   /usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: 
Conversion of the second argument of issubdtype from `float` to `np.floating` 
is deprecated. In future, it will be treated as `np.float64 == 
np.dtype(float).type`.
     from ._conv import register_converters as _register_converters
   ('Version      :', '1.0.0')
   ('Directory    :', '/usr/local/lib/python2.7/dist-packages/mxnet')
   ('Commit Hash   :', '25720d0e3c29232a37e2650f3ba3a2454f9367bb')
   ----------System Info----------
   ('Platform     :', 'Linux-4.13.0-1008-gcp-x86_64-with-Ubuntu-16.04-xenial')
   ('system       :', 'Linux')
   ('node         :', 'test-gpu01')
   ('release      :', '4.13.0-1008-gcp')
   ('version      :', '#11-Ubuntu SMP Thu Jan 25 11:08:44 UTC 2018')
   ----------Hardware Info----------
   ('machine      :', 'x86_64')
   ('processor    :', 'x86_64')
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                4
   On-line CPU(s) list:   0-3
   Thread(s) per core:    2
   Core(s) per socket:    2
   Socket(s):             1
   NUMA node(s):          1
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 79
   Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
   Stepping:              0
   CPU MHz:               2200.000
   BogoMIPS:              4400.00
   Hypervisor vendor:     KVM
   Virtualization type:   full
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              56320K
   NUMA node0 CPU(s):     0-3
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 fma 
cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand 
hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust 
bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0107 
sec, LOAD: 0.6854 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0087 sec, LOAD: 
0.2399 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0822 sec, LOAD: 0.4332 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0203 sec, 
LOAD: 0.0873 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0880 sec, LOAD: 
0.0706 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0905 sec, LOAD: 
0.5254 sec.
   ```
   
   Package used (Python/R/Scala/Julia):
   (I'm using python)
   mxnet is installed via pip
   
   ## Steps to reproduce
   ```
   $ time python train_cifar10.py --gpus 0,1
   /usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: 
Conversion of the second argument of issubdtype from `float` to `np.floating` 
is deprecated. In future, it will be treated as `np.float64 == 
np.dtype(float).type`.
     from ._conv import register_converters as _register_converters
   INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection 
(1): data.mxnet.io
   DEBUG:requests.packages.urllib3.connectionpool:"GET 
/data/cifar10/cifar10_val.rec HTTP/1.1" 200 32040000
   INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection 
(1): data.mxnet.io
   DEBUG:requests.packages.urllib3.connectionpool:"GET 
/data/cifar10/cifar10_train.rec HTTP/1.1" 200 160200000
   INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, 
data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', 
data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, 
dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0,1', 
image_shape='3,28,28', initializer='default', kv_store='device', 
load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', 
macrobatch_size=0, max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, 
max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, 
max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, 
monitor=0, network='resnet', num_classes=10, num_epochs=300, 
num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, 
random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, 
warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
   [23:13:23] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: 
data/cifar10_train.rec, use 1 threads for decoding..
   [23:13:26] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: 
data/cifar10_val.rec, use 1 threads for decoding..
   [23:13:27] src/operator/././cudnn_algoreg-inl.h:107: Running performance 
tests to find the best convolution algorithm, this can take a while... (setting 
env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   [23:13:29] src/kvstore/././comm.h:653: only 0 out of 2 GPU pairs are enabled 
direct access. It may affect the performance. You can set 
MXNET_ENABLE_GPU_P2P=0 to turn it off
   [23:13:29] src/kvstore/././comm.h:662: ..
   [23:13:29] src/kvstore/././comm.h:662: ..
   INFO:root:Epoch[0] Batch [20]        Speed: 1422.52 samples/sec      
accuracy=0.138021
   INFO:root:Epoch[0] Batch [40]        Speed: 1364.23 samples/sec      
accuracy=0.188672
   INFO:root:Epoch[0] Batch [60]        Speed: 1376.49 samples/sec      
accuracy=0.207031
   INFO:root:Epoch[0] Batch [80]        Speed: 1330.66 samples/sec      
accuracy=0.224219
   INFO:root:Epoch[0] Batch [100]       Speed: 1335.36 samples/sec      
accuracy=0.230469
   INFO:root:Epoch[0] Batch [120]       Speed: 1379.96 samples/sec      
accuracy=0.252734
   INFO:root:Epoch[0] Batch [140]       Speed: 1367.87 samples/sec      
accuracy=0.276953
   INFO:root:Epoch[0] Batch [160]       Speed: 1340.25 samples/sec      
accuracy=0.285547
   INFO:root:Epoch[0] Batch [180]       Speed: 1369.41 samples/sec      
accuracy=0.271484
   INFO:root:Epoch[0] Batch [200]       Speed: 1370.23 samples/sec      
accuracy=0.292578
   INFO:root:Epoch[0] Batch [220]       Speed: 1339.09 samples/sec      
accuracy=0.311328
   INFO:root:Epoch[0] Batch [240]       Speed: 1384.28 samples/sec      
accuracy=0.292969
   INFO:root:Epoch[0] Batch [260]       Speed: 1354.43 samples/sec      
accuracy=0.302344
   INFO:root:Epoch[0] Batch [280]       Speed: 1364.57 samples/sec      
accuracy=0.318750
   INFO:root:Epoch[0] Batch [300]       Speed: 1377.05 samples/sec      
accuracy=0.339453
   INFO:root:Epoch[0] Batch [320]       Speed: 1392.14 samples/sec      
accuracy=0.353125
   INFO:root:Epoch[0] Batch [340]       Speed: 1321.70 samples/sec      
accuracy=0.359375
   INFO:root:Epoch[0] Batch [360]       Speed: 1356.89 samples/sec      
accuracy=0.360938
   INFO:root:Epoch[0] Batch [380]       Speed: 1334.74 samples/sec      
accuracy=0.375781
   INFO:root:Epoch[0] Train-accuracy=0.365625
   INFO:root:Epoch[0] Time cost=37.347
   INFO:root:Epoch[0] Validation-accuracy=0.392009
   INFO:root:Epoch[1] Batch [20]        Speed: 1363.96 samples/sec      
accuracy=0.385417
   INFO:root:Epoch[1] Batch [40]        Speed: 1379.70 samples/sec      
accuracy=0.385937
   INFO:root:Epoch[1] Batch [60]        Speed: 1322.08 samples/sec      
accuracy=0.389062
   INFO:root:Epoch[1] Batch [80]        Speed: 1355.44 samples/sec      
accuracy=0.391016
   INFO:root:Epoch[1] Batch [100]       Speed: 1333.35 samples/sec      
accuracy=0.423047
   INFO:root:Epoch[1] Batch [120]       Speed: 1335.11 samples/sec      
accuracy=0.418750
   INFO:root:Epoch[1] Batch [140]       Speed: 1376.00 samples/sec      
accuracy=0.426953
   INFO:root:Epoch[1] Batch [160]       Speed: 1376.29 samples/sec      
accuracy=0.458984
   INFO:root:Epoch[1] Batch [180]       Speed: 1359.63 samples/sec      
accuracy=0.462891
   INFO:root:Epoch[1] Batch [200]       Speed: 1315.08 samples/sec      
accuracy=0.465234
   INFO:root:Epoch[1] Batch [220]       Speed: 1375.77 samples/sec      
accuracy=0.455078
   INFO:root:Epoch[1] Batch [240]       Speed: 1349.76 samples/sec      
accuracy=0.465625
   INFO:root:Epoch[1] Batch [260]       Speed: 1349.80 samples/sec      
accuracy=0.491406
   INFO:root:Epoch[1] Batch [280]       Speed: 1361.70 samples/sec      
accuracy=0.497266
   INFO:root:Epoch[1] Batch [300]       Speed: 1374.36 samples/sec      
accuracy=0.511719
   INFO:root:Epoch[1] Batch [320]       Speed: 1354.98 samples/sec      
accuracy=0.530469
   INFO:root:Epoch[1] Batch [340]       Speed: 1366.00 samples/sec      
accuracy=0.532031
   INFO:root:Epoch[1] Batch [360]       Speed: 1373.03 samples/sec      
accuracy=0.546484
   INFO:root:Epoch[1] Batch [380]       Speed: 1317.23 samples/sec      
accuracy=0.532422
   INFO:root:Epoch[1] Train-accuracy=0.522656
   INFO:root:Epoch[1] Time cost=36.902
   INFO:root:Epoch[1] Validation-accuracy=0.538862
   ```
   ## What have you tried to solve it?
   Tried to sun with one GPU - same issue.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to