chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821
 
 

 ##########
 File path: src/api/config.i
 ##########
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   In additional to the above, I also did a 8 * K80 multi-GPUs training and 
evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training 
loss from 3983.8 to 35.56 in 100 Epochs, and evaluation accuracy to 90.6% 
(maximum at epoch 90). However, this does not include the synchronization of 
running mean and variance before the evaluation phase:
   ```
   Epoch=0: 100%|██████████| 195/195 [06:06<00:00,  1.91s/it]Training loss = 
3983.820557, training accuracy = 0.225260
   Test accuracy = 0.347556
   Epoch=1: 100%|██████████| 195/195 [06:17<00:00,  1.94s/it]Training loss = 
2628.622070, training accuracy = 0.379768
   Test accuracy = 0.437700
   Epoch=2: 100%|██████████| 195/195 [06:12<00:00,  1.89s/it]Training loss = 
2347.072266, training accuracy = 0.448558
   Test accuracy = 0.459936
   Epoch=3: 100%|██████████| 195/195 [06:13<00:00,  1.88s/it]Training loss = 
2075.987305, training accuracy = 0.517348
   Test accuracy = 0.548978
   Epoch=4: 100%|██████████| 195/195 [06:19<00:00,  1.97s/it]Training loss = 
1890.109985, training accuracy = 0.566847
   Test accuracy = 0.594451
   Epoch=5: 100%|██████████| 195/195 [06:13<00:00,  1.92s/it]Training loss = 
1720.395142, training accuracy = 0.606911
   Test accuracy = 0.633413
   Epoch=6: 100%|██████████| 195/195 [06:10<00:00,  1.92s/it]Training loss = 
1555.737549, training accuracy = 0.645753
   Test accuracy = 0.659054
   Epoch=7: 100%|██████████| 195/195 [06:14<00:00,  1.91s/it]Training loss = 
1385.688477, training accuracy = 0.687220
   Test accuracy = 0.709836
   Epoch=8: 100%|██████████| 195/195 [06:20<00:00,  1.97s/it]Training loss = 
1269.426270, training accuracy = 0.714523
   Test accuracy = 0.735477
   Epoch=9: 100%|██████████| 195/195 [06:15<00:00,  1.91s/it]Training loss = 
1137.953979, training accuracy = 0.746054
   Test accuracy = 0.745393
   Epoch=10: 100%|██████████| 195/195 [06:11<00:00,  1.88s/it]Training loss = 
1031.773071, training accuracy = 0.770353
   Test accuracy = 0.750501
   Epoch=11: 100%|██████████| 195/195 [06:10<00:00,  1.89s/it]Training loss = 
956.600037, training accuracy = 0.788261
   Test accuracy = 0.777744
   Epoch=12: 100%|██████████| 195/195 [06:16<00:00,  1.92s/it]Training loss = 
881.050171, training accuracy = 0.804167
   Test accuracy = 0.793369
   Epoch=13: 100%|██████████| 195/195 [06:16<00:00,  1.92s/it]Training loss = 
828.298828, training accuracy = 0.818309
   Test accuracy = 0.807692
   Epoch=14: 100%|██████████| 195/195 [06:11<00:00,  1.90s/it]Training loss = 
790.558838, training accuracy = 0.823918
   Test accuracy = 0.795373
   Epoch=15: 100%|██████████| 195/195 [06:13<00:00,  1.90s/it]Training loss = 
740.679871, training accuracy = 0.833734
   Test accuracy = 0.816707
   Epoch=16: 100%|██████████| 195/195 [06:20<00:00,  1.95s/it]Training loss = 
691.391479, training accuracy = 0.846855
   Test accuracy = 0.818510
   Epoch=17: 100%|██████████| 195/195 [06:16<00:00,  1.89s/it]Training loss = 
657.708130, training accuracy = 0.853986
   Test accuracy = 0.826122
   Epoch=18: 100%|██████████| 195/195 [06:10<00:00,  1.88s/it]Training loss = 
627.918579, training accuracy = 0.860216
   Test accuracy = 0.844752
   Epoch=19: 100%|██████████| 195/195 [06:13<00:00,  1.91s/it]Training loss = 
592.768982, training accuracy = 0.869551
   Test accuracy = 0.845653
   Epoch=20: 100%|██████████| 195/195 [06:19<00:00,  1.97s/it]Training loss = 
561.560608, training accuracy = 0.875060
   Test accuracy = 0.835938
   Epoch=21: 100%|██████████| 195/195 [06:15<00:00,  1.97s/it]Training loss = 
533.083740, training accuracy = 0.881370
   Test accuracy = 0.849860
   Epoch=22: 100%|██████████| 195/195 [06:12<00:00,  1.91s/it]Training loss = 
508.004578, training accuracy = 0.885056
   Test accuracy = 0.833434
   Epoch=23: 100%|██████████| 195/195 [06:12<00:00,  1.92s/it]Training loss = 
477.516602, training accuracy = 0.892488
   Test accuracy = 0.858474
   Epoch=24: 100%|██████████| 195/195 [06:20<00:00,  1.96s/it]Training loss = 
455.839996, training accuracy = 0.896595
   Test accuracy = 0.867388
   Epoch=25: 100%|██████████| 195/195 [06:16<00:00,  1.95s/it]Training loss = 
434.568390, training accuracy = 0.904327
   Test accuracy = 0.858774
   Epoch=26: 100%|██████████| 195/195 [06:10<00:00,  1.87s/it]Training loss = 
414.232391, training accuracy = 0.907071
   Test accuracy = 0.833333
   Epoch=27: 100%|██████████| 195/195 [06:13<00:00,  1.87s/it]Training loss = 
400.625458, training accuracy = 0.909275
   Test accuracy = 0.858974
   Epoch=28: 100%|██████████| 195/195 [06:20<00:00,  1.95s/it]Training loss = 
378.750885, training accuracy = 0.914443
   Test accuracy = 0.865885
   Epoch=29: 100%|██████████| 195/195 [06:14<00:00,  1.91s/it]Training loss = 
369.449249, training accuracy = 0.917548
   Test accuracy = 0.871394
   Epoch=30: 100%|██████████| 195/195 [06:13<00:00,  1.93s/it]Training loss = 
345.693939, training accuracy = 0.921935
   Test accuracy = 0.868389
   Epoch=31: 100%|██████████| 195/195 [06:13<00:00,  1.88s/it]Training loss = 
333.472717, training accuracy = 0.924860
   Test accuracy = 0.865885
   Epoch=32: 100%|██████████| 195/195 [06:15<00:00,  1.97s/it]Training loss = 
316.274231, training accuracy = 0.927244
   Test accuracy = 0.867889
   Epoch=33: 100%|██████████| 195/195 [06:15<00:00,  1.95s/it]Training loss = 
300.943665, training accuracy = 0.931871
   Test accuracy = 0.871194
   Epoch=34: 100%|██████████| 195/195 [06:12<00:00,  1.93s/it]Training loss = 
299.318787, training accuracy = 0.931270
   Test accuracy = 0.876402
   Epoch=35: 100%|██████████| 195/195 [06:10<00:00,  1.88s/it]Training loss = 
285.711884, training accuracy = 0.935317
   Test accuracy = 0.879207
   Epoch=36: 100%|██████████| 195/195 [06:16<00:00,  1.98s/it]Training loss = 
266.605042, training accuracy = 0.939844
   Test accuracy = 0.882612
   Epoch=37: 100%|██████████| 195/195 [06:15<00:00,  1.93s/it]Training loss = 
253.637848, training accuracy = 0.943069
   Test accuracy = 0.882111
   Epoch=38: 100%|██████████| 195/195 [06:09<00:00,  1.92s/it]Training loss = 
243.406281, training accuracy = 0.944832
   Test accuracy = 0.888421
   Epoch=39: 100%|██████████| 195/195 [06:11<00:00,  1.92s/it]Training loss = 
236.608551, training accuracy = 0.945553
   Test accuracy = 0.868089
   Epoch=40: 100%|██████████| 195/195 [06:21<00:00,  1.93s/it]Training loss = 
226.691986, training accuracy = 0.948798
   Test accuracy = 0.874099
   Epoch=41: 100%|██████████| 195/195 [06:15<00:00,  1.94s/it]Training loss = 
210.119171, training accuracy = 0.952724
   Test accuracy = 0.885517
   Epoch=42: 100%|██████████| 195/195 [06:12<00:00,  1.92s/it]Training loss = 
200.071671, training accuracy = 0.954687
   Test accuracy = 0.872696
   Epoch=43: 100%|██████████| 195/195 [06:13<00:00,  1.94s/it]Training loss = 
201.704514, training accuracy = 0.954527
   Test accuracy = 0.867788
   Epoch=44: 100%|██████████| 195/195 [06:20<00:00,  1.95s/it]Training loss = 
197.687622, training accuracy = 0.955469
   Test accuracy = 0.868690
   Epoch=45: 100%|██████████| 195/195 [06:15<00:00,  1.93s/it]Training loss = 
176.998566, training accuracy = 0.959675
   Test accuracy = 0.879307
   Epoch=46: 100%|██████████| 195/195 [06:12<00:00,  1.94s/it]Training loss = 
169.160126, training accuracy = 0.961478
   Test accuracy = 0.879307
   Epoch=47: 100%|██████████| 195/195 [06:13<00:00,  1.94s/it]Training loss = 
166.751923, training accuracy = 0.961438
   Test accuracy = 0.876202
   Epoch=48: 100%|██████████| 195/195 [06:20<00:00,  1.94s/it]Training loss = 
163.559586, training accuracy = 0.962460
   Test accuracy = 0.886218
   Epoch=49: 100%|██████████| 195/195 [06:14<00:00,  1.91s/it]Training loss = 
157.634018, training accuracy = 0.964483
   Test accuracy = 0.882812
   Epoch=50: 100%|██████████| 195/195 [06:12<00:00,  1.90s/it]Training loss = 
142.496307, training accuracy = 0.967869
   Test accuracy = 0.886218
   Epoch=51: 100%|██████████| 195/195 [06:09<00:00,  1.81s/it]Training loss = 
140.872879, training accuracy = 0.968169
   Test accuracy = 0.894131
   Epoch=52: 100%|██████████| 195/195 [06:20<00:00,  1.99s/it]Training loss = 
142.073883, training accuracy = 0.968189
   Test accuracy = 0.889824
   Epoch=53: 100%|██████████| 195/195 [06:16<00:00,  1.88s/it]Training loss = 
138.559738, training accuracy = 0.968329
   Test accuracy = 0.876903
   Epoch=54: 100%|██████████| 195/195 [06:10<00:00,  1.92s/it]Training loss = 
132.399109, training accuracy = 0.969752
   Test accuracy = 0.890425
   Epoch=55: 100%|██████████| 195/195 [06:11<00:00,  1.91s/it]Training loss = 
123.129364, training accuracy = 0.971755
   Test accuracy = 0.881711
   Epoch=56: 100%|██████████| 195/195 [06:21<00:00,  1.93s/it]Training loss = 
121.916557, training accuracy = 0.971995
   Test accuracy = 0.894631
   Epoch=57: 100%|██████████| 195/195 [06:14<00:00,  1.91s/it]Training loss = 
111.385445, training accuracy = 0.974860
   Test accuracy = 0.891426
   Epoch=58: 100%|██████████| 195/195 [06:10<00:00,  1.87s/it]Training loss = 
117.021904, training accuracy = 0.973938
   Test accuracy = 0.886719
   Epoch=59: 100%|██████████| 195/195 [06:11<00:00,  1.89s/it]Training loss = 
100.442093, training accuracy = 0.977264
   Test accuracy = 0.884215
   Epoch=60: 100%|██████████| 195/195 [06:18<00:00,  1.92s/it]Training loss = 
103.660690, training accuracy = 0.976342
   Test accuracy = 0.890525
   Epoch=61: 100%|██████████| 195/195 [06:15<00:00,  1.93s/it]Training loss = 
106.059982, training accuracy = 0.975861
   Test accuracy = 0.897236
   Epoch=62: 100%|██████████| 195/195 [06:10<00:00,  1.89s/it]Training loss = 
100.289398, training accuracy = 0.977604
   Test accuracy = 0.887921
   Epoch=63: 100%|██████████| 195/195 [06:12<00:00,  1.91s/it]Training loss = 
93.661957, training accuracy = 0.978906
   Test accuracy = 0.880108
   Epoch=64: 100%|██████████| 195/195 [06:19<00:00,  1.92s/it]Training loss = 
88.674843, training accuracy = 0.980048
   Test accuracy = 0.886719
   Epoch=65: 100%|██████████| 195/195 [06:15<00:00,  1.92s/it]Training loss = 
88.595192, training accuracy = 0.980088
   Test accuracy = 0.882111
   Epoch=66: 100%|██████████| 195/195 [06:12<00:00,  1.93s/it]Training loss = 
80.745857, training accuracy = 0.982272
   Test accuracy = 0.894331
   Epoch=67: 100%|██████████| 195/195 [06:12<00:00,  1.91s/it]Training loss = 
79.769966, training accuracy = 0.982151
   Test accuracy = 0.893530
   Epoch=68: 100%|██████████| 195/195 [06:21<00:00,  1.97s/it]Training loss = 
86.334030, training accuracy = 0.980369
   Test accuracy = 0.883413
   Epoch=69: 100%|██████████| 195/195 [06:14<00:00,  1.91s/it]Training loss = 
82.313301, training accuracy = 0.982091
   Test accuracy = 0.889423
   Epoch=70: 100%|██████████| 195/195 [06:10<00:00,  1.89s/it]Training loss = 
76.229935, training accuracy = 0.983373
   Test accuracy = 0.870292
   Epoch=71: 100%|██████████| 195/195 [06:12<00:00,  1.95s/it]Training loss = 
71.863472, training accuracy = 0.983914
   Test accuracy = 0.893930
   Epoch=72: 100%|██████████| 195/195 [06:20<00:00,  1.94s/it]Training loss = 
66.012581, training accuracy = 0.985156
   Test accuracy = 0.898337
   Epoch=73: 100%|██████████| 195/195 [06:15<00:00,  1.96s/it]Training loss = 
61.428085, training accuracy = 0.986619
   Test accuracy = 0.893029
   Epoch=74: 100%|██████████| 195/195 [06:11<00:00,  1.90s/it]Training loss = 
67.723068, training accuracy = 0.984976
   Test accuracy = 0.898538
   Epoch=75: 100%|██████████| 195/195 [06:13<00:00,  1.91s/it]Training loss = 
65.637268, training accuracy = 0.985176
   Test accuracy = 0.900741
   Epoch=76: 100%|██████████| 195/195 [06:18<00:00,  1.97s/it]Training loss = 
67.880424, training accuracy = 0.985036
   Test accuracy = 0.897536
   Epoch=77: 100%|██████████| 195/195 [06:16<00:00,  1.93s/it]Training loss = 
61.967018, training accuracy = 0.986078
   Test accuracy = 0.897436
   Epoch=78: 100%|██████████| 195/195 [06:13<00:00,  1.93s/it]Training loss = 
61.895309, training accuracy = 0.986058
   Test accuracy = 0.898938
   Epoch=79: 100%|██████████| 195/195 [06:13<00:00,  1.90s/it]Training loss = 
61.111233, training accuracy = 0.985697
   Test accuracy = 0.898738
   Epoch=80: 100%|██████████| 195/195 [06:21<00:00,  1.97s/it]Training loss = 
55.601448, training accuracy = 0.987099
   Test accuracy = 0.899639
   Epoch=81: 100%|██████████| 195/195 [06:13<00:00,  1.89s/it]Training loss = 
57.219810, training accuracy = 0.987500
   Test accuracy = 0.887720
   Epoch=82: 100%|██████████| 195/195 [06:13<00:00,  1.92s/it]Training loss = 
58.462112, training accuracy = 0.987240
   Test accuracy = 0.894832
   Epoch=83: 100%|██████████| 195/195 [06:11<00:00,  1.86s/it]Training loss = 
55.885990, training accuracy = 0.987500
   Test accuracy = 0.904647
   Epoch=84: 100%|██████████| 195/195 [06:21<00:00,  2.00s/it]Training loss = 
48.977169, training accuracy = 0.988982
   Test accuracy = 0.870192
   Epoch=85: 100%|██████████| 195/195 [06:15<00:00,  1.93s/it]Training loss = 
47.429367, training accuracy = 0.989984
   Test accuracy = 0.880208
   Epoch=86: 100%|██████████| 195/195 [06:12<00:00,  1.88s/it]Training loss = 
51.012726, training accuracy = 0.988401
   Test accuracy = 0.890124
   Epoch=87: 100%|██████████| 195/195 [06:14<00:00,  1.95s/it]Training loss = 
49.567501, training accuracy = 0.988702
   Test accuracy = 0.901042
   Epoch=88: 100%|██████████| 195/195 [06:20<00:00,  1.96s/it]Training loss = 
44.965919, training accuracy = 0.990124
   Test accuracy = 0.890925
   Epoch=89: 100%|██████████| 195/195 [06:17<00:00,  1.98s/it]Training loss = 
52.335827, training accuracy = 0.988241
   Test accuracy = 0.898438
   Epoch=90: 100%|██████████| 195/195 [06:11<00:00,  1.90s/it]Training loss = 
43.000404, training accuracy = 0.990204
   Test accuracy = 0.906050
   Epoch=91: 100%|██████████| 195/195 [06:12<00:00,  1.90s/it]Training loss = 
44.402187, training accuracy = 0.990865
   Test accuracy = 0.881010
   Epoch=92: 100%|██████████| 195/195 [06:21<00:00,  1.93s/it]Training loss = 
42.708675, training accuracy = 0.991026
   Test accuracy = 0.898738
   Epoch=93: 100%|██████████| 195/195 [06:14<00:00,  1.96s/it]Training loss = 
40.271782, training accuracy = 0.991346
   Test accuracy = 0.880809
   Epoch=94: 100%|██████████| 195/195 [06:10<00:00,  1.88s/it]Training loss = 
43.947540, training accuracy = 0.990224
   Test accuracy = 0.897636
   Epoch=95: 100%|██████████| 195/195 [06:12<00:00,  1.92s/it]Training loss = 
39.025536, training accuracy = 0.991667
   Test accuracy = 0.902143
   Epoch=96: 100%|██████████| 195/195 [06:19<00:00,  1.98s/it]Training loss = 
38.811058, training accuracy = 0.991526
   Test accuracy = 0.902945
   Epoch=97: 100%|██████████| 195/195 [06:15<00:00,  1.94s/it]Training loss = 
44.107109, training accuracy = 0.990004
   Test accuracy = 0.896034
   Epoch=98: 100%|██████████| 195/195 [06:09<00:00,  1.91s/it]Training loss = 
32.846859, training accuracy = 0.993109
   Test accuracy = 0.898137
   Epoch=99: 100%|██████████| 195/195 [06:13<00:00,  1.91s/it]Training loss = 
35.559738, training accuracy = 0.992468
   Test accuracy = 0.899639
   ```
   The code used is as below:
   ```
   #
   # Licensed to the Apache Software Foundation (ASF) under one
   # or more contributor license agreements.  See the NOTICE file
   # distributed with this work for additional information
   # regarding copyright ownership.  The ASF licenses this file
   # to you under the Apache License, Version 2.0 (the
   # "License"); you may not use this file except in compliance
   # with the License.  You may obtain a copy of the License at
   #
   #   http://www.apache.org/licenses/LICENSE-2.0
   #
   # Unless required by applicable law or agreed to in writing,
   # software distributed under the License is distributed on an
   # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   # KIND, either express or implied.  See the License for the
   # specific language governing permissions and limitations
   # under the License.
   #
   
   try:
       import pickle
   except ImportError:
       import cPickle as pickle
       
   from singa import singa_wrap as singa
   from singa import autograd
   from singa import tensor
   from singa import device
   from singa import opt
   import cv2
   #from scipy import misc
   import numpy as np
   from tqdm import trange
   
   def load_dataset(filepath):
       print('Loading data file %s' % filepath)
       with open(filepath, 'rb') as fd:
           try:
               cifar10 = pickle.load(fd, encoding='latin1')
           except TypeError:
               cifar10 = pickle.load(fd)
       image = cifar10['data'].astype(dtype=np.uint8)
       image = image.reshape((-1, 3, 32, 32))
       label = np.asarray(cifar10['labels'], dtype=np.uint8)
       label = label.reshape(label.size, 1)
       return image, label
   
   
   def load_train_data(dir_path='cifar-10-batches-py', num_batches=5):
       labels = []
       batchsize = 10000
       images = np.empty((num_batches * batchsize, 3, 32, 32), dtype=np.uint8)
       for did in range(1, num_batches + 1):
           fname_train_data = dir_path + "/data_batch_{}".format(did)
           image, label = load_dataset(fname_train_data)
           images[(did - 1) * batchsize:did * batchsize] = image
           labels.extend(label)
       images = np.array(images, dtype=np.float32)
       labels = np.array(labels, dtype=np.int32)
       return images, labels
   
   
   def load_test_data(dir_path='cifar-10-batches-py'):
       images, labels = load_dataset(dir_path + "/test_batch")
       return np.array(images,  dtype=np.float32), np.array(labels, 
dtype=np.int32)
   
   def normalize_for_resnet(train_x, test_x):   
       mean=[0.4914, 0.4822, 0.4465]
       std=[0.2023, 0.1994, 0.2010] 
       train_x /= 255
       test_x /= 255
       for ch in range(0,2):
           train_x[:, ch, :, :] -= mean[ch]
           train_x[:, ch, :, :] /= std[ch]
           test_x[:, ch, :, :] -= mean[ch]
           test_x[:, ch, :, :] /= std[ch]
       return train_x, test_x
   
   def resize_dataset(x,IMG_SIZE):
       num_data = x.shape[0]
       dim = x.shape[1]
       X = np.zeros(shape=(num_data,dim,IMG_SIZE,IMG_SIZE), dtype=np.float32)
       for n in range(0,num_data):
           for d in range(0,dim):
               X[n, d, :, :] = cv2.resize(x[n , d, : ,:], 
(IMG_SIZE,IMG_SIZE)).astype(np.float32)
       return X
   
   def augmentation(x, batch_size):
       xpad = np.pad(x, [[0, 0], [0, 0], [4, 4], [4, 4]], 'symmetric')
       for data_num in range(0, batch_size):
           offset = np.random.randint(8, size=2)
           x[data_num,:,:,:] = xpad[data_num, :, offset[0]: offset[0] + 32, 
offset[1]: offset[1] + 32]
           if_flip = np.random.randint(2)
           if (if_flip):
               x[data_num, :, :, :] = x[data_num, :, :, ::-1]
       return x
   
   def accuracy(pred, target):
       y = np.argmax(pred, axis=1)
       t = np.argmax(target, axis=1)
       a = y == t
       return np.array(a, "int").sum()
   
   def to_categorical(y, num_classes):
       y = np.array(y, dtype="int")
       n = y.shape[0]
       categorical = np.zeros((n, num_classes))
       for i in range(0,n):
         categorical[i, y[i]] = 1
         categorical = categorical.astype(np.float32)
       return categorical
   
   def data_partition(dataset_x, dataset_y, rank_in_global, world_size):
       data_per_rank = dataset_x.shape[0] // world_size
       idx_start = rank_in_global * data_per_rank
       idx_end = (rank_in_global + 1) * data_per_rank
       return dataset_x[idx_start: idx_end], dataset_y[idx_start: idx_end]
   
   def sychronize(tensor, dist_opt):
       singa.synch(tensor.data, dist_opt.communicator)
       # cannot use tensor/=dist_opt.world_size because "/=" not in place, but 
"-=" is in place
       tensor -= (dist_opt.world_size - 1) * tensor / dist_opt.world_size    
   
   if __name__ == '__main__':
   
   
       sgd = opt.SGD(lr=0.04, momentum=0.9, weight_decay=1e-5)
       sgd = opt.DistOpt(sgd)
   
       #load dataset
       #need to download with "/python3 
incubator-singa/examples/cifar10/download_data.py py"
       train_x, train_y = load_train_data()
       test_x, test_y = load_test_data()
       train_x, test_x = normalize_for_resnet(train_x, test_x)
       train_x, train_y = data_partition(train_x, train_y, sgd.rank_in_global, 
sgd.world_size)
       test_x, test_y = data_partition(test_x, test_y, sgd.rank_in_global, 
sgd.world_size)
   
       num_classes=10
   
       from resnet import resnet50
       model = resnet50(num_classes=num_classes)
   
       print('Start intialization............')
       dev = device.create_cuda_gpu_on(sgd.rank_in_local)
   
       max_epoch = 100
       batch_size = 32
       IMG_SIZE = 224
       tx = tensor.Tensor((batch_size, 3, IMG_SIZE, IMG_SIZE), dev, 
tensor.float32)
       ty = tensor.Tensor((batch_size,), dev, tensor.int32)
       num_train_batch = train_x.shape[0] // batch_size
       num_test_batch = test_x.shape[0] // batch_size
       idx = np.arange(train_x.shape[0], dtype=np.int32)
       reducer = tensor.Tensor((1,), dev, tensor.float32)
   
       #allreduce the initialize parameter
       autograd.training = True
       #x = np.zeros(shape=[batch_size, 3, IMG_SIZE, IMG_SIZE], 
dtype=np.float32)
       #y = np.zeros(shape=[batch_size], dtype=np.int32)
       x = np.random.randn(batch_size, 3, IMG_SIZE, IMG_SIZE).astype(np.float32)
       y = np.random.randint(0, num_classes, batch_size, dtype=np.int32)
       tx.copy_from_numpy(x)
       ty.copy_from_numpy(y)
       out = model(tx)
       loss = autograd.softmax_cross_entropy(out, ty)               
       for p, g in autograd.backward(loss):
           sychronize(p, sgd)
   
       for epoch in range(max_epoch):
           np.random.shuffle(idx)
   
           #Training Phase
           autograd.training = True
           train_correct = np.zeros(shape=[1],dtype=np.float32)
           test_correct = np.zeros(shape=[1],dtype=np.float32)
           train_loss = np.zeros(shape=[1],dtype=np.float32)
           with trange(num_train_batch) as t:
               t.set_description('Epoch={}'.format(epoch))
               for b in t:
                   x = train_x[idx[b * batch_size: (b + 1) * batch_size]]
                   x = augmentation(x, batch_size)
                   x = resize_dataset(x,IMG_SIZE)
                   y = train_y[idx[b * batch_size: (b + 1) * batch_size]]
                   tx.copy_from_numpy(x)
                   ty.copy_from_numpy(y)
                   out = model(tx)
                   loss = autograd.softmax_cross_entropy(out, ty)               
                   train_correct += accuracy(tensor.to_numpy(out), 
to_categorical(y, num_classes)).astype(np.float32)
                   train_loss += tensor.to_numpy(loss)[0]
                   for p, g in autograd.backward(loss):
                       sgd.update(p, g)
   
           #print("rank"+str(sgd.rank_in_global)+": Acc="+str(train_correct)+". 
Loss="+str(train_loss), flush=True)
   
           #print("world size="+str(sgd.world_size), flush=True)
   
           #reduce all the accuracy and loss from different rank
           reducer.copy_from_numpy(train_correct)
           reducer=sgd.all_reduce(reducer)
           train_correct = tensor.to_numpy(reducer) 
   
           reducer.copy_from_numpy(train_loss)
           reducer=sgd.all_reduce(reducer)
           train_loss = tensor.to_numpy(reducer) * sgd.world_size
   
           #if(sgd.rank_in_global==0):
           #    print('Training loss = %f, Acc count = %f' % (train_loss, 
train_correct), flush=True)
   
           if(sgd.rank_in_global==0):
               print('Training loss = %f, training accuracy = %f' % 
(train_loss, train_correct / (num_train_batch*batch_size)), flush=True)
   
   
           #Evaulation Phase
           autograd.training = False
           for b in range(num_test_batch):
               x = test_x[b * batch_size: (b + 1) * batch_size]
               x = resize_dataset(x,IMG_SIZE)
               y = test_y[b * batch_size: (b + 1) * batch_size]
               tx.copy_from_numpy(x)
               ty.copy_from_numpy(y)
               out_test = model(tx)
               test_correct += accuracy(tensor.to_numpy(out_test), 
to_categorical(y, num_classes))
   
           reducer.copy_from_numpy(test_correct)
           reducer=sgd.all_reduce(reducer)
           test_correct = tensor.to_numpy(reducer) 
   
           if(sgd.rank_in_global==0):
               print('Test accuracy = %f' % (test_correct / 
(num_test_batch*(batch_size))), flush=True)
   
   ```
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to