rongzha1 commented on issue #12866: Optimization for embedding OP for CPU
URL: https://github.com/apache/incubator-mxnet/pull/12866#issuecomment-440888276
 
 
   > Any test to back up the speed up claim
   
   before this optimization, we run wide_deep model inference,  3068232.151  
samples/sec, profile:
   
   Time of each OP:
   --
   CopyCPU2CPU               176.736 ms    1.841                ms/call    96   
 calls       28.38 %
   _contrib_SparseEmbedding  134.894 ms    0.16213221153846155  ms/call      
832 calls       21.66 %
   slice                     119.339 ms    1.2431145833333332   ms/call      96 
 calls       19.17 %
   Concat                    74.123  ms      2.31634375             ms/call    
32  calls         11.90 %
   FullyConnected            56.788  ms      0.5915416666666666     ms/call    
96  calls         9.12 %
   SliceChannel              41.911  ms      1.30971875             ms/call    
32  calls         6.73 %
   dot                       8.206   ms      0.2564375              ms/call    
32  calls         1.32 %
   SoftmaxOutput             3.458   ms      0.1080625              ms/call    
32  calls         0.56 %
   Activation                3.14    ms      0.0490625              ms/call    
64  calls         0.50 %
   elemwise_add              1.893   ms      0.05915625             ms/call    
32  calls         0.30 %
   DeleteVariable            1.135   ms      0.017461538461538462 ms/call      
65  calls       0.18 %
   broadcast_add             0.641   ms      0.02003125             ms/call    
32  calls         0.10 %
   WaitForVar                0.388   ms      0.012125               ms/call    
32  calls         0.06 %
    
   Total OP Time: 622.65200000 ms
   
   after optimization,  3804926.458 samples/sec, profile:
   
   Time of each OP:
   --
   CopyCPU2CPU               173.408 ms    1.8063333333333331   ms/call      96 
 calls       34.01 %
   slice                     115.525 ms    1.2033854166666667   ms/call      96 
 calls       22.66 %
   Concat                    69.754  ms      2.1798125              ms/call    
32  calls         13.68 %
   FullyConnected            53.635  ms      0.5586979166666667     ms/call    
96  calls         10.52 %
   SliceChannel              40.2    ms      1.25625                ms/call    
32  calls         7.88 %
   _contrib_SparseEmbedding  39.528    ms    0.047509615384615386   ms/call    
832 calls       7.75 %
   dot                       8.279   ms      0.25871875             ms/call    
32  calls         1.62 %
   SoftmaxOutput             3.341   ms      0.10440625             ms/call    
32  calls         0.66 %
   Activation                2.858   ms      0.04465625             ms/call    
64  calls         0.56 %
   elemwise_add              1.673   ms      0.05228125             ms/call    
32  calls         0.33 %
   DeleteVariable            0.962   ms      0.03103225806451613    ms/call    
31  calls         0.19 %
   broadcast_add             0.478   ms      0.0149375              ms/call    
32  calls         0.09 %
   WaitForVar                0.233   ms      0.00728125             ms/call    
32  calls         0.05 %
    
   Total OP Time: 509.87400000 ms
   
   
   we can find Embedding OP reduced from 134.894 ms to 39.528ms
   
   
   
   
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to