chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639
 
 

 ##########
 File path: src/api/config.i
 ##########
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 6th August: I removed a bug in the commit 0616000 which concerns 
to the number of parameters in the all-reduce. Then I did a 8 * K80 multi-GPUs 
training and evaluation test with a simple MNIST dataset on simple CNN. It 
reduces the training loss from 802.7 to 42.2 in about 30 Epochs: 
   ```
   Epoch=0: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 
802.659485, training accuracy = 0.713825
   Test accuracy = 0.920025
   Epoch=1: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 
246.589371, training accuracy = 0.916767
   Test accuracy = 0.956106
   Epoch=2: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 
175.012894, training accuracy = 0.941106
   Test accuracy = 0.967208
   Epoch=3: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 
144.684052, training accuracy = 0.951539
   Test accuracy = 0.970806
   Epoch=4: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 
120.399704, training accuracy = 0.959402
   Test accuracy = 0.976049
   Epoch=5: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 
107.832191, training accuracy = 0.963709
   Test accuracy = 0.975946
   Epoch=6: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 
96.289490, training accuracy = 0.967014
   Test accuracy = 0.979441
   Epoch=7: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 
88.031815, training accuracy = 0.970436
   Test accuracy = 0.980983
   Epoch=8: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 
79.349884, training accuracy = 0.973090
   Test accuracy = 0.980058
   Epoch=9: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 
77.825607, training accuracy = 0.974342
   Test accuracy = 0.977282
   Epoch=10: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 
74.710297, training accuracy = 0.974576
   Test accuracy = 0.983861
   Epoch=11: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 
69.400230, training accuracy = 0.976162
   Test accuracy = 0.982936
   Epoch=12: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 
65.100449, training accuracy = 0.978148
   Test accuracy = 0.983553
   Epoch=13: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 
65.113991, training accuracy = 0.978249
   Test accuracy = 0.986534
   Epoch=14: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 
63.065636, training accuracy = 0.978566
   Test accuracy = 0.984683
   Epoch=15: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 
58.334709, training accuracy = 0.980018
   Test accuracy = 0.983758
   Epoch=16: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 
58.280094, training accuracy = 0.980285
   Test accuracy = 0.983655
   Epoch=17: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
53.226196, training accuracy = 0.981420
   Test accuracy = 0.985197
   Epoch=18: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
55.968140, training accuracy = 0.980786
   Test accuracy = 0.982422
   Epoch=19: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 
52.761921, training accuracy = 0.982489
   Test accuracy = 0.985814
   Epoch=20: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 
51.989666, training accuracy = 0.982973
   Test accuracy = 0.983758
   Epoch=21: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 
52.571381, training accuracy = 0.982455
   Test accuracy = 0.987973
   Epoch=22: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss = 
49.347313, training accuracy = 0.983140
   Test accuracy = 0.986637
   Epoch=23: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.93it/s]Training loss = 
49.053402, training accuracy = 0.983674
   Test accuracy = 0.985814
   Epoch=24: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 99.28it/s] Training loss = 
46.263908, training accuracy = 0.984442
   Test accuracy = 0.986431
   Epoch=25: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 104.22it/s]Training loss = 
46.021286, training accuracy = 0.984275
   Test accuracy = 0.987664
   Epoch=26: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 103.67it/s]Training loss = 
45.950298, training accuracy = 0.984091
   Test accuracy = 0.986534
   Epoch=27: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.87it/s]Training loss = 
43.926952, training accuracy = 0.984675
   Test accuracy = 0.987150
   Epoch=28: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.89it/s]Training loss = 
44.020412, training accuracy = 0.985110
   Test accuracy = 0.983450
   Epoch=29: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 103.06it/s]Training loss = 
41.906254, training accuracy = 0.985744
   Test accuracy = 0.984375
   Epoch=30: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.93it/s]Training loss = 
42.237778, training accuracy = 0.985527
   Test accuracy = 0.987664
   ```
   
   The following is the code used:
   ```
   #
   # Licensed to the Apache Software Foundation (ASF) under one
   # or more contributor license agreements.  See the NOTICE file
   # distributed with this work for additional information
   # regarding copyright ownership.  The ASF licenses this file
   # to you under the Apache License, Version 2.0 (the
   # "License"); you may not use this file except in compliance
   # with the License.  You may obtain a copy of the License at
   #
   #   http://www.apache.org/licenses/LICENSE-2.0
   #
   # Unless required by applicable law or agreed to in writing,
   # software distributed under the License is distributed on an
   # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   # KIND, either express or implied.  See the License for the
   # specific language governing permissions and limitations
   # under the License.
   #
   
   # the code is modified from
   # https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py
   
   try:
       import pickle
   except ImportError:
       import cPickle as pickle
   
   from singa import singa_wrap as singa
   from singa import autograd
   from singa import tensor
   from singa import device
   from singa import opt
   import numpy as np
   from tqdm import trange
   
   import os
   import urllib.request
   import gzip
   import codecs
   
   def load_dataset():
       train_x_url = 
'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz'
       train_y_url = 
'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
       valid_x_url = 
'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz'
       valid_y_url = 
'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
       train_x = read_image_file(check_exist_or_download(train_x_url)).astype(
           np.float32)
       train_y = read_label_file(check_exist_or_download(train_y_url)).astype(
           np.float32)
       valid_x = read_image_file(check_exist_or_download(valid_x_url)).astype(
           np.float32)
       valid_y = read_label_file(check_exist_or_download(valid_y_url)).astype(
           np.float32)
       return train_x, train_y, valid_x, valid_y
   
   
   def check_exist_or_download(url):
   
       download_dir = '/tmp/'
   
       name = url.rsplit('/', 1)[-1]
       filename = os.path.join(download_dir, name)
       if not os.path.isfile(filename):
           print("Downloading %s" % url)
           urllib.request.urlretrieve(url, filename)
       return filename
   
   
   def read_label_file(path):
       with gzip.open(path, 'rb') as f:
           data = f.read()
           assert get_int(data[:4]) == 2049
           length = get_int(data[4:8])
           parsed = np.frombuffer(data, dtype=np.uint8, offset=8).reshape(
               (length))
           return parsed
   
   
   def get_int(b):
       return int(codecs.encode(b, 'hex'), 16)
   
   
   def read_image_file(path):
       with gzip.open(path, 'rb') as f:
           data = f.read()
           assert get_int(data[:4]) == 2051
           length = get_int(data[4:8])
           num_rows = get_int(data[8:12])
           num_cols = get_int(data[12:16])
           parsed = np.frombuffer(data, dtype=np.uint8, offset=16).reshape(
               (length, 1, num_rows, num_cols))
           return parsed
   
   
   
   def normalize_for_resnet(train_x, test_x):   
       mean=[0.4914, 0.4822, 0.4465]
       std=[0.2023, 0.1994, 0.2010] 
       train_x /= 255
       test_x /= 255
       for ch in range(0,2):
           train_x[:, ch, :, :] -= mean[ch]
           train_x[:, ch, :, :] /= std[ch]
           test_x[:, ch, :, :] -= mean[ch]
           test_x[:, ch, :, :] /= std[ch]
       return train_x, test_x
   
   
   def augmentation(x, batch_size):
       xpad = np.pad(x, [[0, 0], [0, 0], [4, 4], [4, 4]], 'symmetric')
       for data_num in range(0, batch_size):
           offset = np.random.randint(8, size=2)
           x[data_num,:,:,:] = xpad[data_num, :, offset[0]: offset[0] + 28, 
offset[1]: offset[1] + 28]
           if_flip = np.random.randint(2)
           if (if_flip):
               x[data_num, :, :, :] = x[data_num, :, :, ::-1]
       return x
   
   def accuracy(pred, target):
       y = np.argmax(pred, axis=1)
       t = np.argmax(target, axis=1)
       a = y == t
       return np.array(a, "int").sum()
   
   def to_categorical(y, num_classes):
       """
       Converts a class vector (integers) to binary class matrix.
   
       Args
           y: class vector to be converted into a matrix
               (integers from 0 to num_classes).
           num_classes: total number of classes.
   
       Return
           A binary matrix representation of the input.
       """
       y = np.array(y, dtype="int")
       n = y.shape[0]
       categorical = np.zeros((n, num_classes))
       categorical[np.arange(n), y] = 1
       categorical = categorical.astype(np.float32)
       return categorical
   
   def accuracy(pred, target):
       y = np.argmax(pred, axis=1)
       t = np.argmax(target, axis=1)
       a = y == t
       return np.array(a, "int").sum()
   
   class CNN:
       def __init__(self):
           self.conv1 = autograd.Conv2d(1, 20, 5, padding=0)
           self.conv2 = autograd.Conv2d(20, 50, 5, padding=0)
           self.linear1 = autograd.Linear(4 * 4 * 50, 500)
           self.linear2 = autograd.Linear(500, 10)
           self.pooling1 = autograd.MaxPool2d(2, 2, padding=0)
           self.pooling2 = autograd.MaxPool2d(2, 2, padding=0)
   
       def forward(self, x):
           y = self.conv1(x)
           y = autograd.relu(y)
           y = self.pooling1(y)
           y = self.conv2(y)
           y = autograd.relu(y)
           y = self.pooling2(y)
           y = autograd.flatten(y)
           y = self.linear1(y)
           y = autograd.relu(y)
           y = self.linear2(y)
           return y
   
   def data_partition(dataset_x, dataset_y, rank_in_global, world_size):
       data_per_rank = dataset_x.shape[0] // world_size
       idx_start = rank_in_global * data_per_rank
       idx_end = (rank_in_global + 1) * data_per_rank
       return dataset_x[idx_start: idx_end], dataset_y[idx_start: idx_end]
   
   def sychronize(tensor, dist_opt):
       singa.synch(tensor.data, dist_opt.communicator)
       # cannot use tensor/=dist_opt.world_size because "/=" not in place, but 
"-=" is in place
       tensor -= (dist_opt.world_size - 1) * tensor / dist_opt.world_size    
   
   
   
   if __name__ == '__main__':
   
   
       sgd = opt.SGD(lr=0.04, momentum=0.9, weight_decay=1e-5)
       sgd = opt.DistOpt(sgd)
   
       # load data
       train_x, train_y, test_x, test_y = load_dataset()
       # normalization
       train_x = train_x / 255
       test_x = test_x / 255
       num_classes=10
   
       train_y = to_categorical(train_y, num_classes)
       test_y = to_categorical(test_y, num_classes)
   
   
       train_x, train_y = data_partition(train_x, train_y, sgd.rank_in_global, 
sgd.world_size)
       test_x, test_y = data_partition(test_x, test_y, sgd.rank_in_global, 
sgd.world_size)
   
       #print(train_y[0])
   
       print(np.shape(train_x))
       print(np.shape(train_y))
   
       # create model
       model = CNN()
   
       print('Start intialization............')
       dev = device.create_cuda_gpu_on(sgd.rank_in_local)
   
       max_epoch = 100
       batch_size = 64
       IMG_SIZE = 28
       tx = tensor.Tensor((batch_size, 1, IMG_SIZE, IMG_SIZE), dev, 
tensor.float32)
       ty = tensor.Tensor((batch_size, num_classes), dev, tensor.int32)
       num_train_batch = train_x.shape[0] // batch_size
       num_test_batch = test_x.shape[0] // batch_size
       idx = np.arange(train_x.shape[0], dtype=np.int32)
       reducer = tensor.Tensor((1,), dev, tensor.float32)
   
       #allreduce the initialize parameter
       autograd.training = True
       #x = np.zeros(shape=[batch_size, 1, IMG_SIZE, IMG_SIZE], 
dtype=np.float32)
       #y = np.zeros(shape=[batch_size], dtype=np.int32)
       x = np.random.randn(batch_size, 1, IMG_SIZE, IMG_SIZE).astype(np.float32)
       y = np.zeros( shape=(batch_size, num_classes), dtype=np.int32)
       tx.copy_from_numpy(x)
       ty.copy_from_numpy(y)
       out = model.forward(tx)
       loss = autograd.softmax_cross_entropy(out, ty)               
       for p, g in autograd.backward(loss):
           #p=sgd.all_reduce(p)
           sychronize(p, sgd)
   
       for epoch in range(max_epoch):
           np.random.shuffle(idx)
   
           #Training Phase
           autograd.training = True
           train_correct = np.zeros(shape=[1],dtype=np.float32)
           test_correct = np.zeros(shape=[1],dtype=np.float32)
           train_loss = np.zeros(shape=[1],dtype=np.float32)
           with trange(num_train_batch) as t:
               t.set_description('Epoch={}'.format(epoch))
               for b in t:
                   x = train_x[idx[b * batch_size: (b + 1) * batch_size]]
                   x = augmentation(x, batch_size)
                   y = train_y[idx[b * batch_size: (b + 1) * batch_size]]
                   tx.copy_from_numpy(x)
                   ty.copy_from_numpy(y)
                   out = model.forward(tx)
                   loss = autograd.softmax_cross_entropy(out, ty)               
                   train_correct += accuracy(tensor.to_numpy(out), y)
                   train_loss += tensor.to_numpy(loss)[0]
                   for p, g in autograd.backward(loss):
                       sgd.update(p, g)
   
           #reduce the accuracy from multiple device
           reducer.copy_from_numpy(train_correct)
           reducer=sgd.all_reduce(reducer)
           train_correct = tensor.to_numpy(reducer) 
   
           #reduce the loss from multiple device
           reducer.copy_from_numpy(train_loss)
           reducer=sgd.all_reduce(reducer)
           train_loss = tensor.to_numpy(reducer) * sgd.world_size
   
           if(sgd.rank_in_global==0):
               print('Training loss = %f, training accuracy = %f' % 
(train_loss, train_correct / (num_train_batch*batch_size)), flush=True)
   
           #Evaulation Phase
           autograd.training = False
           for b in range(num_test_batch):
               x = test_x[b * batch_size: (b + 1) * batch_size]
               y = test_y[b * batch_size: (b + 1) * batch_size]
               tx.copy_from_numpy(x)
               ty.copy_from_numpy(y)
               out_test = model.forward(tx)
               test_correct += accuracy(tensor.to_numpy(out_test), y)
   
           reducer.copy_from_numpy(test_correct)
           reducer=sgd.all_reduce(reducer)
           test_correct = tensor.to_numpy(reducer) 
   
           if(sgd.rank_in_global==0):
               print('Test accuracy = %f' % (test_correct / 
(num_test_batch*(batch_size))), flush=True)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to