lanking520 opened a new issue #15067: CachedOp performance regression URL: https://github.com/apache/incubator-mxnet/issues/15067 Recently I am running benchmark on the cachedOp performance and get some regression on the result. Please see the table below: | | Module API | cachedOp with Static | CachedOp without static | |------------|------------|----------------------|-------------------------| | p2.8xlarge | 43ms | 42ms | 51ms | | p3.2xlarge | 11ms | 19ms | 16ms | | c5.4xlarge | 36ms | 38ms | 42ms | I would like to highlight the GPU performance comparison. You can see on P2 there is a performance gain with the flag being set but regression in P3. ``` imported_net.hybridize(static_alloc = True, static_shape = True) ``` In theory, it is expected the performance boost if you set these two flags since memory is reused. However, on large GPU it seemed not performing fine. ## Benchmark Script ```python import mxnet as mx from mxnet import ndarray as nd import numpy as np import json, time, os from mxnet import gluon path='http://data.mxnet.io/models/imagenet/' [mx.test_utils.download(path+'resnet/152-layers/resnet-152-0000.params'), mx.test_utils.download(path+'resnet/152-layers/resnet-152-symbol.json'), mx.test_utils.download(path+'synset.txt')] def compute_stats(perf_results, results): results["average"] = np.average(perf_results) results['tp50'] = np.percentile(perf_results, 50) results['tp90'] = np.percentile(perf_results, 90) results['tp99'] = np.percentile(perf_results, 99) ctx_str = os.environ['BENCHMARK_CTX'] if ctx_str == 'GPU': ctx = mx.gpu(0) elif ctx_str == 'CPU': ctx = mx.cpu() benchmark = {} prefix = 'resnet-152' # Model Partition time t1 = time.time() imported_net = gluon.nn.SymbolBlock.imports(prefix + '-symbol.json', ['data', 'softmax_label'], prefix + '-0000.params') t2 = time.time() elapsed = (t2 - t1) * 1000 imported_net.hybridize(static_alloc = True, static_shape = True) benchmark['ModelLoadTime'] = elapsed fname = mx.test_utils.download('https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true') img = mx.image.imread(fname) # convert into format (batch, RGB, width, height) img = mx.image.imresize(img, 300, 300) # resize img = img.transpose((2, 0, 1)) # Channel first img = img.expand_dims(axis=0) # batchify img = img.astype('float32') sf_label = nd.ones((1)) if ctx_str == 'GPU': img = img.as_in_context(mx.gpu(0)) # First Inference t1 = time.time() op = imported_net(img, sf_label) op.wait_to_read() t2 = time.time() elapsed = (t2 - t1) * 1000 benchmark['FirstInferCall'] = elapsed times = 100 time_cost = [] for idx in range(0, times): t1 = time.time() op = imported_net(img, sf_label) op.wait_to_read() t2 = time.time() elapsed = (t2 - t1) * 1000 time_cost.append(elapsed) print("time cost: ", elapsed, "ms") benchmark['ModelLoadTime'] = benchmark['FirstInferCall'] - time_cost[0] compute_stats(time_cost, benchmark) output = json.dumps(benchmark) f = open('Inf.json', 'w') f.write(output) f.close() ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
