chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821
########## File path: src/api/config.i ########## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: In additional to the above, I also did a 8 * K80 multi-GPUs training and evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training loss from 3983.8 to 345.7 in about 30 Epochs, and evaluation accuracy to 86.8%. However, this does not include the synchronization of running mean and variance before the evaluation phase: ``` Epoch=0: 100%|██████████| 195/195 [06:06<00:00, 1.91s/it]Training loss = 3983.820557, training accuracy = 0.225260 Test accuracy = 0.347556 Epoch=1: 100%|██████████| 195/195 [06:17<00:00, 1.94s/it]Training loss = 2628.622070, training accuracy = 0.379768 Test accuracy = 0.437700 Epoch=2: 100%|██████████| 195/195 [06:12<00:00, 1.89s/it]Training loss = 2347.072266, training accuracy = 0.448558 Test accuracy = 0.459936 Epoch=3: 100%|██████████| 195/195 [06:13<00:00, 1.88s/it]Training loss = 2075.987305, training accuracy = 0.517348 Test accuracy = 0.548978 Epoch=4: 100%|██████████| 195/195 [06:19<00:00, 1.97s/it]Training loss = 1890.109985, training accuracy = 0.566847 Test accuracy = 0.594451 Epoch=5: 100%|██████████| 195/195 [06:13<00:00, 1.92s/it]Training loss = 1720.395142, training accuracy = 0.606911 Test accuracy = 0.633413 Epoch=6: 100%|██████████| 195/195 [06:10<00:00, 1.92s/it]Training loss = 1555.737549, training accuracy = 0.645753 Test accuracy = 0.659054 Epoch=7: 100%|██████████| 195/195 [06:14<00:00, 1.91s/it]Training loss = 1385.688477, training accuracy = 0.687220 Test accuracy = 0.709836 Epoch=8: 100%|██████████| 195/195 [06:20<00:00, 1.97s/it]Training loss = 1269.426270, training accuracy = 0.714523 Test accuracy = 0.735477 Epoch=9: 100%|██████████| 195/195 [06:15<00:00, 1.91s/it]Training loss = 1137.953979, training accuracy = 0.746054 Test accuracy = 0.745393 Epoch=10: 100%|██████████| 195/195 [06:11<00:00, 1.88s/it]Training loss = 1031.773071, training accuracy = 0.770353 Test accuracy = 0.750501 Epoch=11: 100%|██████████| 195/195 [06:10<00:00, 1.89s/it]Training loss = 956.600037, training accuracy = 0.788261 Test accuracy = 0.777744 Epoch=12: 100%|██████████| 195/195 [06:16<00:00, 1.92s/it]Training loss = 881.050171, training accuracy = 0.804167 Test accuracy = 0.793369 Epoch=13: 100%|██████████| 195/195 [06:16<00:00, 1.92s/it]Training loss = 828.298828, training accuracy = 0.818309 Test accuracy = 0.807692 Epoch=14: 100%|██████████| 195/195 [06:11<00:00, 1.90s/it]Training loss = 790.558838, training accuracy = 0.823918 Test accuracy = 0.795373 Epoch=15: 100%|██████████| 195/195 [06:13<00:00, 1.90s/it]Training loss = 740.679871, training accuracy = 0.833734 Test accuracy = 0.816707 Epoch=16: 100%|██████████| 195/195 [06:20<00:00, 1.95s/it]Training loss = 691.391479, training accuracy = 0.846855 Test accuracy = 0.818510 Epoch=17: 100%|██████████| 195/195 [06:16<00:00, 1.89s/it]Training loss = 657.708130, training accuracy = 0.853986 Test accuracy = 0.826122 Epoch=18: 100%|██████████| 195/195 [06:10<00:00, 1.88s/it]Training loss = 627.918579, training accuracy = 0.860216 Test accuracy = 0.844752 Epoch=19: 100%|██████████| 195/195 [06:13<00:00, 1.91s/it]Training loss = 592.768982, training accuracy = 0.869551 Test accuracy = 0.845653 Epoch=20: 100%|██████████| 195/195 [06:19<00:00, 1.97s/it]Training loss = 561.560608, training accuracy = 0.875060 Test accuracy = 0.835938 Epoch=21: 100%|██████████| 195/195 [06:15<00:00, 1.97s/it]Training loss = 533.083740, training accuracy = 0.881370 Test accuracy = 0.849860 Epoch=22: 100%|██████████| 195/195 [06:12<00:00, 1.91s/it]Training loss = 508.004578, training accuracy = 0.885056 Test accuracy = 0.833434 Epoch=23: 100%|██████████| 195/195 [06:12<00:00, 1.92s/it]Training loss = 477.516602, training accuracy = 0.892488 Test accuracy = 0.858474 Epoch=24: 100%|██████████| 195/195 [06:20<00:00, 1.96s/it]Training loss = 455.839996, training accuracy = 0.896595 Test accuracy = 0.867388 Epoch=25: 100%|██████████| 195/195 [06:16<00:00, 1.95s/it]Training loss = 434.568390, training accuracy = 0.904327 Test accuracy = 0.858774 Epoch=26: 100%|██████████| 195/195 [06:10<00:00, 1.87s/it]Training loss = 414.232391, training accuracy = 0.907071 Test accuracy = 0.833333 Epoch=27: 100%|██████████| 195/195 [06:13<00:00, 1.87s/it]Training loss = 400.625458, training accuracy = 0.909275 Test accuracy = 0.858974 Epoch=28: 100%|██████████| 195/195 [06:20<00:00, 1.95s/it]Training loss = 378.750885, training accuracy = 0.914443 Test accuracy = 0.865885 Epoch=29: 100%|██████████| 195/195 [06:14<00:00, 1.91s/it]Training loss = 369.449249, training accuracy = 0.917548 Test accuracy = 0.871394 Epoch=30: 100%|██████████| 195/195 [06:13<00:00, 1.93s/it]Training loss = 345.693939, training accuracy = 0.921935 Test accuracy = 0.868389 ``` The code used is as below: ``` # # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # try: import pickle except ImportError: import cPickle as pickle from singa import singa_wrap as singa from singa import autograd from singa import tensor from singa import device from singa import opt import cv2 #from scipy import misc import numpy as np from tqdm import trange def load_dataset(filepath): print('Loading data file %s' % filepath) with open(filepath, 'rb') as fd: try: cifar10 = pickle.load(fd, encoding='latin1') except TypeError: cifar10 = pickle.load(fd) image = cifar10['data'].astype(dtype=np.uint8) image = image.reshape((-1, 3, 32, 32)) label = np.asarray(cifar10['labels'], dtype=np.uint8) label = label.reshape(label.size, 1) return image, label def load_train_data(dir_path='cifar-10-batches-py', num_batches=5): labels = [] batchsize = 10000 images = np.empty((num_batches * batchsize, 3, 32, 32), dtype=np.uint8) for did in range(1, num_batches + 1): fname_train_data = dir_path + "/data_batch_{}".format(did) image, label = load_dataset(fname_train_data) images[(did - 1) * batchsize:did * batchsize] = image labels.extend(label) images = np.array(images, dtype=np.float32) labels = np.array(labels, dtype=np.int32) return images, labels def load_test_data(dir_path='cifar-10-batches-py'): images, labels = load_dataset(dir_path + "/test_batch") return np.array(images, dtype=np.float32), np.array(labels, dtype=np.int32) def normalize_for_resnet(train_x, test_x): mean=[0.4914, 0.4822, 0.4465] std=[0.2023, 0.1994, 0.2010] train_x /= 255 test_x /= 255 for ch in range(0,2): train_x[:, ch, :, :] -= mean[ch] train_x[:, ch, :, :] /= std[ch] test_x[:, ch, :, :] -= mean[ch] test_x[:, ch, :, :] /= std[ch] return train_x, test_x def resize_dataset(x,IMG_SIZE): num_data = x.shape[0] dim = x.shape[1] X = np.zeros(shape=(num_data,dim,IMG_SIZE,IMG_SIZE), dtype=np.float32) for n in range(0,num_data): for d in range(0,dim): X[n, d, :, :] = cv2.resize(x[n , d, : ,:], (IMG_SIZE,IMG_SIZE)).astype(np.float32) return X def augmentation(x, batch_size): xpad = np.pad(x, [[0, 0], [0, 0], [4, 4], [4, 4]], 'symmetric') for data_num in range(0, batch_size): offset = np.random.randint(8, size=2) x[data_num,:,:,:] = xpad[data_num, :, offset[0]: offset[0] + 32, offset[1]: offset[1] + 32] if_flip = np.random.randint(2) if (if_flip): x[data_num, :, :, :] = x[data_num, :, :, ::-1] return x def accuracy(pred, target): y = np.argmax(pred, axis=1) t = np.argmax(target, axis=1) a = y == t return np.array(a, "int").sum() def to_categorical(y, num_classes): y = np.array(y, dtype="int") n = y.shape[0] categorical = np.zeros((n, num_classes)) for i in range(0,n): categorical[i, y[i]] = 1 categorical = categorical.astype(np.float32) return categorical def data_partition(dataset_x, dataset_y, rank_in_global, world_size): data_per_rank = dataset_x.shape[0] // world_size idx_start = rank_in_global * data_per_rank idx_end = (rank_in_global + 1) * data_per_rank return dataset_x[idx_start: idx_end], dataset_y[idx_start: idx_end] def sychronize(tensor, dist_opt): singa.synch(tensor.data, dist_opt.communicator) # cannot use tensor/=dist_opt.world_size because "/=" not in place, but "-=" is in place tensor -= (dist_opt.world_size - 1) * tensor / dist_opt.world_size if __name__ == '__main__': sgd = opt.SGD(lr=0.04, momentum=0.9, weight_decay=1e-5) sgd = opt.DistOpt(sgd) #load dataset #need to download with "/python3 incubator-singa/examples/cifar10/download_data.py py" train_x, train_y = load_train_data() test_x, test_y = load_test_data() train_x, test_x = normalize_for_resnet(train_x, test_x) train_x, train_y = data_partition(train_x, train_y, sgd.rank_in_global, sgd.world_size) test_x, test_y = data_partition(test_x, test_y, sgd.rank_in_global, sgd.world_size) num_classes=10 from resnet import resnet50 model = resnet50(num_classes=num_classes) print('Start intialization............') dev = device.create_cuda_gpu_on(sgd.rank_in_local) max_epoch = 100 batch_size = 32 IMG_SIZE = 224 tx = tensor.Tensor((batch_size, 3, IMG_SIZE, IMG_SIZE), dev, tensor.float32) ty = tensor.Tensor((batch_size,), dev, tensor.int32) num_train_batch = train_x.shape[0] // batch_size num_test_batch = test_x.shape[0] // batch_size idx = np.arange(train_x.shape[0], dtype=np.int32) reducer = tensor.Tensor((1,), dev, tensor.float32) #allreduce the initialize parameter autograd.training = True #x = np.zeros(shape=[batch_size, 3, IMG_SIZE, IMG_SIZE], dtype=np.float32) #y = np.zeros(shape=[batch_size], dtype=np.int32) x = np.random.randn(batch_size, 3, IMG_SIZE, IMG_SIZE).astype(np.float32) y = np.random.randint(0, num_classes, batch_size, dtype=np.int32) tx.copy_from_numpy(x) ty.copy_from_numpy(y) out = model(tx) loss = autograd.softmax_cross_entropy(out, ty) for p, g in autograd.backward(loss): sychronize(p, sgd) for epoch in range(max_epoch): np.random.shuffle(idx) #Training Phase autograd.training = True train_correct = np.zeros(shape=[1],dtype=np.float32) test_correct = np.zeros(shape=[1],dtype=np.float32) train_loss = np.zeros(shape=[1],dtype=np.float32) with trange(num_train_batch) as t: t.set_description('Epoch={}'.format(epoch)) for b in t: x = train_x[idx[b * batch_size: (b + 1) * batch_size]] x = augmentation(x, batch_size) x = resize_dataset(x,IMG_SIZE) y = train_y[idx[b * batch_size: (b + 1) * batch_size]] tx.copy_from_numpy(x) ty.copy_from_numpy(y) out = model(tx) loss = autograd.softmax_cross_entropy(out, ty) train_correct += accuracy(tensor.to_numpy(out), to_categorical(y, num_classes)).astype(np.float32) train_loss += tensor.to_numpy(loss)[0] for p, g in autograd.backward(loss): sgd.update(p, g) #print("rank"+str(sgd.rank_in_global)+": Acc="+str(train_correct)+". Loss="+str(train_loss), flush=True) #print("world size="+str(sgd.world_size), flush=True) #reduce all the accuracy and loss from different rank reducer.copy_from_numpy(train_correct) reducer=sgd.all_reduce(reducer) train_correct = tensor.to_numpy(reducer) reducer.copy_from_numpy(train_loss) reducer=sgd.all_reduce(reducer) train_loss = tensor.to_numpy(reducer) * sgd.world_size #if(sgd.rank_in_global==0): # print('Training loss = %f, Acc count = %f' % (train_loss, train_correct), flush=True) if(sgd.rank_in_global==0): print('Training loss = %f, training accuracy = %f' % (train_loss, train_correct / (num_train_batch*batch_size)), flush=True) #Evaulation Phase autograd.training = False for b in range(num_test_batch): x = test_x[b * batch_size: (b + 1) * batch_size] x = resize_dataset(x,IMG_SIZE) y = test_y[b * batch_size: (b + 1) * batch_size] tx.copy_from_numpy(x) ty.copy_from_numpy(y) out_test = model(tx) test_correct += accuracy(tensor.to_numpy(out_test), to_categorical(y, num_classes)) reducer.copy_from_numpy(test_correct) reducer=sgd.all_reduce(reducer) test_correct = tensor.to_numpy(reducer) if(sgd.rank_in_global==0): print('Test accuracy = %f' % (test_correct / (num_test_batch*(batch_size))), flush=True) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services