bgawrych opened a new pull request #20621:
URL: https://github.com/apache/incubator-mxnet/pull/20621


   ## Description ##
   Improves performance of stack operation. Performance results shows 
significant speedup on axis=0 (up to 7x faster).
   
   Performance results collected on CLX8280 with 
`KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 
numactl --physcpubind=0-27 --membind=0`:
   <html xmlns:v="urn:schemas-microsoft-com:vml"
   xmlns:o="urn:schemas-microsoft-com:office:office"
   xmlns:x="urn:schemas-microsoft-com:office:excel"
   xmlns="http://www.w3.org/TR/REC-html40";>
   
   <head>
   
   <meta name=ProgId content=Excel.Sheet>
   <meta name=Generator content="Microsoft Excel 15">
   <link id=Main-File rel=Main-File
   href="file:///C:/Users/bgawrych/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
   <link rel=File-List
   
href="file:///C:/Users/bgawrych/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
   </head>
   
   <body link="#0563C1" vlink="#954F72">
   
   
   
     |   | master | onednn
   -- | -- | -- | --
   shape | axis | time | time
   (128, 128) | 0 | 0.007561 | 0.008217
   (128, 128) | 1 | 0.004158 | 0.00457
   (128, 512) | 0 | 0.014108 | 0.007263
   (128, 512) | 1 | 0.004416 | 0.005567
   (128, 1024) | 0 | 0.024753 | 0.009431
   (128, 1024) | 1 | 0.0046 | 0.004892
   (128, 4096) | 0 | 0.088938 | 0.025933
   (128, 4096) | 1 | 0.006305 | 0.006167
   (512, 128) | 0 | 0.012593 | 0.006721
   (512, 128) | 1 | 0.004545 | 0.00462
   (512, 512) | 0 | 0.043897 | 0.01301
   (512, 512) | 1 | 0.005042 | 0.005218
   (512, 1024) | 0 | 0.079853 | 0.016997
   (512, 1024) | 1 | 0.006117 | 0.006382
   (512, 4096) | 0 | 0.517834 | 0.097284
   (512, 4096) | 1 | 0.070154 | 0.038691
   (1024, 128) | 0 | 0.022151 | 0.008327
   (1024, 128) | 1 | 0.004755 | 0.004991
   (1024, 512) | 0 | 0.080592 | 0.017348
   (1024, 512) | 1 | 0.006391 | 0.006452
   (1024, 1024) | 0 | 0.205667 | 0.040287
   (1024, 1024) | 1 | 0.013286 | 0.013144
   (1024, 4096) | 0 | 1.159914 | 0.267409
   (1024, 4096) | 1 | 0.174798 | 0.153152
   (4096, 128) | 0 | 0.081543 | 0.017331
   (4096, 128) | 1 | 0.006936 | 0.006952
   (4096, 512) | 0 | 0.575121 | 0.079814
   (4096, 512) | 1 | 0.084379 | 0.040853
   (4096, 1024) | 0 | 1.244555 | 0.251577
   (4096, 1024) | 1 | 0.1782 | 0.154799
   (4096, 4096) | 0 | 5.169306 | 1.180926
   (4096, 4096) | 1 | 0.766602 | 0.740192
   (32, 128, 128) | 0 | 0.080957 | 0.017508
   (32, 128, 128) | 1 | 0.00692 | 0.006721
   (32, 128, 128) | 2 | 0.006921 | 0.006859
   (32, 128, 512) | 0 | 0.555404 | 0.081633
   (32, 128, 512) | 1 | 0.077143 | 0.037545
   (32, 128, 512) | 2 | 0.083525 | 0.041425
   (32, 128, 1024) | 0 | 1.225558 | 0.255515
   (32, 128, 1024) | 1 | 0.190202 | 0.154146
   (32, 128, 1024) | 2 | 0.177495 | 0.1549
   (32, 128, 4096) | 0 | 5.006225 | 1.090737
   (32, 128, 4096) | 1 | 0.831286 | 0.759118
   (32, 128, 4096) | 2 | 0.765793 | 0.742179
   (32, 512, 128) | 0 | 0.560635 | 0.090112
   (32, 512, 128) | 1 | 0.076585 | 0.042584
   (32, 512, 128) | 2 | 0.095465 | 0.04338
   (32, 512, 512) | 0 | 2.536246 | 0.541157
   (32, 512, 512) | 1 | 0.397728 | 0.341854
   (32, 512, 512) | 2 | 0.407399 | 0.35051
   (32, 512, 1024) | 0 | 5.034069 | 1.092211
   (32, 512, 1024) | 1 | 0.830025 | 0.760654
   (32, 512, 1024) | 2 | 0.772979 | 0.740602
   (32, 512, 4096) | 0 | 20.72267 | 4.655413
   (32, 512, 4096) | 1 | 3.503717 | 3.075174
   (32, 512, 4096) | 2 | 3.02452 | 3.002688
   (32, 1024, 128) | 0 | 1.196986 | 0.24314
   (32, 1024, 128) | 1 | 0.190801 | 0.15396
   (32, 1024, 128) | 2 | 0.220551 | 0.154176
   (32, 1024, 512) | 0 | 5.024947 | 1.09671
   (32, 1024, 512) | 1 | 0.828758 | 0.768377
   (32, 1024, 512) | 2 | 0.842505 | 0.748963
   (32, 1024, 1024) | 0 | 10.38974 | 2.242758
   (32, 1024, 1024) | 1 | 1.869875 | 1.547855
   (32, 1024, 1024) | 2 | 1.538614 | 1.496964
   (32, 1024, 4096) | 0 | 41.49604 | 9.043207
   (32, 1024, 4096) | 1 | 7.476183 | 6.120244
   (32, 1024, 4096) | 2 | 6.035282 | 6.005883
   (32, 4096, 128) | 0 | 5.003981 | 1.093519
   (32, 4096, 128) | 1 | 0.826404 | 0.769504
   (32, 4096, 128) | 2 | 0.926172 | 0.751195
   (32, 4096, 512) | 0 | 20.09485 | 4.335339
   (32, 4096, 512) | 1 | 3.502722 | 3.074266
   (32, 4096, 512) | 2 | 3.359437 | 3.006657
   (32, 4096, 1024) | 0 | 40.23293 | 8.423752
   (32, 4096, 1024) | 1 | 7.462769 | 6.156365
   (32, 4096, 1024) | 2 | 6.141823 | 6.012172
   (32, 4096, 4096) | 0 | 154.2752 | 35.87757
   (32, 4096, 4096) | 1 | 28.58106 | 24.12571
   (32, 4096, 4096) | 2 | 24.46488 | 23.94641
   
   
   
   </body>
   
   </html>
   
   
   ```
   import mxnet
   import mxnet.gluon.nn as nn
   import mxnet.numpy as np
   import time
   
   class TestStack(nn.HybridBlock):
       def __init__(self, axis=None):
           super(TestStack, self).__init__()
           self._axis = axis
   
       def forward(self, a, *args):
           return np.stack([a] + list(args), axis=self._axis)
   
   dims = [128, 512, 1024, 4096]
   print("shape;axis;time")
   for ndim in range (2):
      for dim1 in dims:
        for dim2 in dims:
           shape = (dim1, dim2) if ndim == 0 else (32, dim1, dim2)
           a = np.random.uniform(-1.0, 1.0, shape).astype(np.float32)
           b = np.random.uniform(-1.0, 1.0, shape).astype(np.float32)
           c = np.random.uniform(-1.0, 1.0, shape).astype(np.float32)
           d = np.random.uniform(-1.0, 1.0, shape).astype(np.float32)
           for axis in range(2 + ndim):
               stack = TestStack(axis)
               stack.hybridize()
               tic = time.time()
               for i in range(100):
                   out = np.stack([a, b, c, d], axis=axis)
                   out.wait_to_read()
               toc = time.time()
               print(f"{shape};{axis};{toc-tic}")
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to