ElaineBao commented on issue #17906: Optimize AddTakeGrad Tensor Sum URL: https://github.com/apache/incubator-mxnet/pull/17906#issuecomment-609050759 Sorry for the late reply. I tried to use opperf, but it doesn't work, some error throwed out when I was using it: ``` # File "/incubator-mxnet/benchmark/opperf/rules/default_params.py", line 606, in <module> # "axis_shape": DEFAULT_AXIS_SHAPE, # NameError: name 'DEFAULT_AXIS_SHAPE' is not defined ``` So I use mxnet profiler to validate the performance, I think it's also reasonable. The script is as follows: ``` import random import pandas as pd import mxnet as mx import numpy as np from sklearn.model_selection import train_test_split batch_size = 1000 num_epoch = 5 model_prefix = 'drivethru_attention_d' n_plus= 522 total = 40000 profiling = True records = [] for i in range(0, total): pluids = [random.randint(0, n_plus - 1) for i in range(0, 5)] label = random.randint(0, 1) records.append((pluids, label)) data = pd.DataFrame(records, columns=['pluids','label']) train, test = train_test_split(data, test_size=0.1, random_state=100) X_train = mx.io.NDArrayIter(data={'pluids': np.array(train['pluids'].values.tolist(), dtype=int)}, label={'output_label': train['label'].values}, batch_size=batch_size, shuffle=True) X_eval = mx.io.NDArrayIter(data={'pluids': np.array(test['pluids'].values.tolist(), dtype=int)}, label={'output_label': test['label'].values}, batch_size=batch_size, shuffle=True) y_true = mx.symbol.Variable('output_label') pluids = mx.symbol.Variable('pluids') plu_embed = mx.symbol.Embedding(data=pluids, input_dim=n_plus, output_dim=50, name='plu_embed') fc1 = mx.symbol.FullyConnected(data=plu_embed, num_hidden=int(n_plus), name='fc1') rec_model = mx.symbol.SoftmaxOutput(data=fc1, label=y_true, name='output') mod = mx.mod.Module(symbol=rec_model, data_names=['pluids'], label_names=['output_label'], context=[mx.cpu()]) # enable profiler mx.profiler.set_config(profile_symbolic=True, profile_imperative=True, profile_memory=False, profile_api=True, filename='profile.json', aggregate_stats=True) mx.profiler.set_state('run') mod.fit(train_data=X_train, num_epoch=num_epoch, initializer=mx.init.Xavier(rnd_type="gaussian"), optimizer='adagrad', eval_metric=['accuracy'], validation_metric=['accuracy', mx.metric.TopKAccuracy(3)], eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 2)) mx.profiler.set_state('stop') print(mx.profiler.dumps()) ``` And the performance: 1. before optimization of _backward_Embedding: ``` operator ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- _backward_Embedding 180 2854.3340 12.3320 29.1350 15.8574 _mul_scalar 1620 527.4130 0.0030 3.3110 0.3256 _backward_FullyConnected 180 162.2140 0.7430 1.6510 0.9012 SoftmaxOutput 200 129.6620 0.1250 1.2650 0.6483 FullyConnected 200 110.0570 0.2340 42.0660 0.5503 argmax 200 49.5320 0.1840 0.4930 0.2477 broadcast_add 1080 31.0420 0.0040 2.9860 0.0287 Embedding 200 25.0530 0.0240 3.8110 0.1253 _backward_SoftmaxOutput 180 19.0860 0.0560 0.8680 0.1060 square 540 18.5240 0.0030 2.8510 0.0343 sqrt 540 17.3870 0.0060 0.9440 0.0322 DeleteVariable 3532 11.3070 0.0020 0.0330 0.0032 broadcast_sub 540 8.2790 0.0040 0.0440 0.0153 broadcast_div 540 7.4970 0.0050 0.0730 0.0139 _plus_scalar 555 6.5850 0.0040 0.0650 0.0119 SetValueOp 8 5.8160 0.0050 5.6540 0.7270 CopyCPU2CPU 448 4.1040 0.0020 0.1030 0.0092 ResourceParallelRandomSetSeed 1 3.8440 3.8440 3.8440 3.8440 WaitForVar 220 1.3150 0.0040 0.0120 0.0060 Cast 33 1.1590 0.0080 0.2680 0.0351 _random_normal 2 0.7270 0.1210 0.6060 0.3635 _zeros 6 0.2650 0.0070 0.0720 0.0442 _div_scalar 15 0.1400 0.0050 0.0190 0.0093 SetupExec 6 0.0150 0.0010 0.0060 0.0025 _full 1 0.0060 0.0060 0.0060 0.0060 ``` 2. after optimization ``` operator ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- _mul_scalar 1620 451.0970 0.0030 3.2960 0.2785 _backward_FullyConnected 180 195.9230 0.7440 2.3720 1.0885 SoftmaxOutput 200 156.3020 0.1080 1.2910 0.7815 FullyConnected 200 136.2320 0.2300 43.5920 0.6812 argmax 200 54.5550 0.1710 0.4960 0.2728 _backward_SoftmaxOutput 180 39.8900 0.0570 0.8930 0.2216 Embedding 200 27.0910 0.0270 3.1330 0.1355 broadcast_add 1080 24.6370 0.0040 0.6560 0.0228 _backward_Embedding 180 21.5230 0.0970 0.4120 0.1196 sqrt 540 20.1840 0.0060 0.1300 0.0374 square 540 19.2420 0.0040 2.9200 0.0356 DeleteVariable 3532 13.1160 0.0010 0.1310 0.0037 broadcast_sub 540 11.0550 0.0040 0.0980 0.0205 broadcast_div 540 9.3750 0.0050 0.1110 0.0174 _plus_scalar 555 8.2280 0.0040 0.1140 0.0148 SetValueOp 8 5.9090 0.0050 5.7620 0.7386 CopyCPU2CPU 448 4.2760 0.0030 0.1040 0.0095 ResourceParallelRandomSetSeed 1 3.8370 3.8370 3.8370 3.8370 Cast 33 1.2800 0.0090 0.2670 0.0388 WaitForVar 195 1.2160 0.0040 0.0180 0.0062 _random_normal 2 0.7190 0.1200 0.5990 0.3595 _zeros 6 0.2710 0.0060 0.0790 0.0452 _div_scalar 15 0.2610 0.0050 0.0790 0.0174 SetupExec 6 0.0150 0.0020 0.0050 0.0025 _full 1 0.0070 0.0070 0.0070 0.0070 ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services