vvchernov opened a new pull request #8599:
URL: https://github.com/apache/tvm/pull/8599


   LSTM cell was unified and transferred to common place for all frontends. 
Here it is simultaneously used by onnx and pytorch frontends of TVM. LSTM cell 
was analyzed and modified to remove excess memory and other manipulations which 
potentially can not be fixed by compiler on its side. Performance tests for 
different modification of LSTM before and after were carried out. The results 
are collected in the tables:
   
   Table 1. Average time per run (microsec) for 10000 runs. The following 
parameters are used (small input size): with biases = True, batch first = True, 
feature size = 5, hidden size = 10, number of stacked layers = 2, sequence 
length = 3, batch size = 1, trials number = 100
   
   Frontend name/LSTM type  |   uni   |    b    |    s    |    sb    | 
   ---------------------------------------------------------------
   Onnx                                    | 26.8    | 55.3  | 50.7  | 112.7  |
   Onnx dev                             | 20.1    | 40.5  | 37.7  |  81.7   |
   Onnx tuned                         |  5.1     | 5.8    | 7.1    |  11.1   |
   Onnx dev tuned                  |  4.7     | 6.0    | 6.2    |  10.2   |
   --------------------------------------------------------------
   Pytorch                                | 12.1   | 19.9   | 20.5  |  37.2   |
   Pytorch dev                         |  8.9    | 14.1   | 14.9  |  27.5   |
   Pytorch tuned                     |   4.8   |  6.0    |  6.4   |   9.9    |
   Pytorch dev tuned               |  4.7   |  6.1    |  6.4   |   9.8    |
   --------------------------------------------------------------
   Onnxruntime                       | 16.0  | 21.1   | 24.8  |  36.7   |
   
   There are several LSTM types: uni – unidirectional, b – bidirectional, s – 
stacked (2 layers are used in the tests), sb - stacked bidirectional. Suffix 
"dev" means implementation in this patch. We have strong difference for 
performance between implementation on onnx and pytorch without tuning (onnx one 
is slower). With tuning onnx implementation was slightly worse than pytorch. 
This patch fixed performance differences for LSTM with tuning and imporved 
results without tuning for both onnx and pytorch. 
   
   Table 2. Average time per run (ms) for 1000 runs. The following parameters 
are used (big input size): with biases = True, batch first = True, feature size 
= 40, hidden size = 256, number of stacked layers = 3, sequence length = 160, 
batch size = 1, trials number = 100
   
   Frontend name/LSTM type  |   uni   |    b    |    s    |    sb    | 
   --------------------------------------------------------------
   Onnx                                    |  47.3   |          | 205  |        
   |
   Onnx dev                             |            |          |         |     
      |
   Onnx tuned                         |  8.74   |          |  31.8 |           |
   Onnx dev tuned                  |            |          |          |         
  |
   --------------------------------------------------------------
   Pytorch                                | 7.77   |          | 27.2  |         
  |
   Pytorch dev                         |           |          |          |      
     |
   Pytorch tuned                     |   7.71  |          |  27.3 |           |
   Pytorch dev tuned               |  7.61  |          |          |           |
   --------------------------------------------------------------
   Onnxruntime                       | 1.50  |           | 4.69  |            |
   
   @masahi @jwfromm please review
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to