ElaineBao commented on issue #17906: Optimize AddTakeGrad Tensor Sum
URL: https://github.com/apache/incubator-mxnet/pull/17906#issuecomment-609050759
 
 
   Sorry for the late reply.
   I tried to use opperf, but it doesn't work, some error throwed out when I 
was using it:
   ```
   #   File "/incubator-mxnet/benchmark/opperf/rules/default_params.py", line 
606, in <module>
   #     "axis_shape": DEFAULT_AXIS_SHAPE,
   # NameError: name 'DEFAULT_AXIS_SHAPE' is not defined
   ```
   
   So I use mxnet profiler to validate the performance, I think it's also 
reasonable.
   The script is as follows:
   ```
   import random
   import pandas as pd
   import mxnet as mx
   import numpy as np
   from sklearn.model_selection import train_test_split
   
   batch_size = 1000
   num_epoch = 5
   model_prefix = 'drivethru_attention_d'
   n_plus= 522
   total = 40000
   profiling = True
   
   records = []
   for i in range(0, total):
       pluids = [random.randint(0, n_plus - 1) for i in range(0, 5)]
       label = random.randint(0, 1)
       records.append((pluids, label))
   
   data = pd.DataFrame(records,
                       columns=['pluids','label'])
   train, test = train_test_split(data, test_size=0.1, random_state=100)
   
   X_train = mx.io.NDArrayIter(data={'pluids': 
np.array(train['pluids'].values.tolist(), dtype=int)},
                               label={'output_label': train['label'].values},
                               batch_size=batch_size,
                               shuffle=True)
   X_eval = mx.io.NDArrayIter(data={'pluids': 
np.array(test['pluids'].values.tolist(), dtype=int)},
                               label={'output_label': test['label'].values},
                               batch_size=batch_size,
                               shuffle=True)
   y_true = mx.symbol.Variable('output_label')
   
   
   pluids = mx.symbol.Variable('pluids')
   plu_embed = mx.symbol.Embedding(data=pluids, input_dim=n_plus, 
output_dim=50, name='plu_embed')
   
   fc1 = mx.symbol.FullyConnected(data=plu_embed, num_hidden=int(n_plus), 
name='fc1')
   rec_model = mx.symbol.SoftmaxOutput(data=fc1, label=y_true, name='output')
   
   mod = mx.mod.Module(symbol=rec_model,
                       data_names=['pluids'],
                       label_names=['output_label'],
                       context=[mx.cpu()])
   # enable profiler
   mx.profiler.set_config(profile_symbolic=True, profile_imperative=True, 
profile_memory=False,
                                   profile_api=True, filename='profile.json', 
aggregate_stats=True)
   mx.profiler.set_state('run')
   
   mod.fit(train_data=X_train,
           num_epoch=num_epoch,
           initializer=mx.init.Xavier(rnd_type="gaussian"),
           optimizer='adagrad',
           eval_metric=['accuracy'],
           validation_metric=['accuracy', mx.metric.TopKAccuracy(3)],
           eval_data=X_eval,
           batch_end_callback=mx.callback.Speedometer(batch_size, 2))
   
   mx.profiler.set_state('stop')
   print(mx.profiler.dumps())
   ```
   
   And the performance:
   1. before optimization of _backward_Embedding:
   ```
   operator
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   _backward_Embedding                   180        2854.3340          12.3320  
        29.1350          15.8574
   _mul_scalar                          1620         527.4130           0.0030  
         3.3110           0.3256
   _backward_FullyConnected              180         162.2140           0.7430  
         1.6510           0.9012
   SoftmaxOutput                         200         129.6620           0.1250  
         1.2650           0.6483
   FullyConnected                        200         110.0570           0.2340  
        42.0660           0.5503
   argmax                                200          49.5320           0.1840  
         0.4930           0.2477
   broadcast_add                        1080          31.0420           0.0040  
         2.9860           0.0287
   Embedding                             200          25.0530           0.0240  
         3.8110           0.1253
   _backward_SoftmaxOutput               180          19.0860           0.0560  
         0.8680           0.1060
   square                                540          18.5240           0.0030  
         2.8510           0.0343
   sqrt                                  540          17.3870           0.0060  
         0.9440           0.0322
   DeleteVariable                       3532          11.3070           0.0020  
         0.0330           0.0032
   broadcast_sub                         540           8.2790           0.0040  
         0.0440           0.0153
   broadcast_div                         540           7.4970           0.0050  
         0.0730           0.0139
   _plus_scalar                          555           6.5850           0.0040  
         0.0650           0.0119
   SetValueOp                              8           5.8160           0.0050  
         5.6540           0.7270
   CopyCPU2CPU                           448           4.1040           0.0020  
         0.1030           0.0092
   ResourceParallelRandomSetSeed               1           3.8440           
3.8440           3.8440           3.8440
   WaitForVar                            220           1.3150           0.0040  
         0.0120           0.0060
   Cast                                   33           1.1590           0.0080  
         0.2680           0.0351
   _random_normal                          2           0.7270           0.1210  
         0.6060           0.3635
   _zeros                                  6           0.2650           0.0070  
         0.0720           0.0442
   _div_scalar                            15           0.1400           0.0050  
         0.0190           0.0093
   SetupExec                               6           0.0150           0.0010  
         0.0060           0.0025
   _full                                   1           0.0060           0.0060  
         0.0060           0.0060
   ```
   
   2. after optimization 
   ```
   operator
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   _mul_scalar                          1620         451.0970           0.0030  
         3.2960           0.2785
   _backward_FullyConnected              180         195.9230           0.7440  
         2.3720           1.0885
   SoftmaxOutput                         200         156.3020           0.1080  
         1.2910           0.7815
   FullyConnected                        200         136.2320           0.2300  
        43.5920           0.6812
   argmax                                200          54.5550           0.1710  
         0.4960           0.2728
   _backward_SoftmaxOutput               180          39.8900           0.0570  
         0.8930           0.2216
   Embedding                             200          27.0910           0.0270  
         3.1330           0.1355
   broadcast_add                        1080          24.6370           0.0040  
         0.6560           0.0228
   _backward_Embedding                   180          21.5230           0.0970  
         0.4120           0.1196
   sqrt                                  540          20.1840           0.0060  
         0.1300           0.0374
   square                                540          19.2420           0.0040  
         2.9200           0.0356
   DeleteVariable                       3532          13.1160           0.0010  
         0.1310           0.0037
   broadcast_sub                         540          11.0550           0.0040  
         0.0980           0.0205
   broadcast_div                         540           9.3750           0.0050  
         0.1110           0.0174
   _plus_scalar                          555           8.2280           0.0040  
         0.1140           0.0148
   SetValueOp                              8           5.9090           0.0050  
         5.7620           0.7386
   CopyCPU2CPU                           448           4.2760           0.0030  
         0.1040           0.0095
   ResourceParallelRandomSetSeed               1           3.8370           
3.8370           3.8370           3.8370
   Cast                                   33           1.2800           0.0090  
         0.2670           0.0388
   WaitForVar                            195           1.2160           0.0040  
         0.0180           0.0062
   _random_normal                          2           0.7190           0.1200  
         0.5990           0.3595
   _zeros                                  6           0.2710           0.0060  
         0.0790           0.0452
   _div_scalar                            15           0.2610           0.0050  
         0.0790           0.0174
   SetupExec                               6           0.0150           0.0020  
         0.0050           0.0025
   _full                                   1           0.0070           0.0070  
         0.0070           0.0070
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to