ibsidorenko opened a new pull request, #14408:
URL: https://github.com/apache/tvm/pull/14408

   **Motivation:**
   It was found that for standalone elementwise operations (add, sub, etc.) 
MetaScheduler generates code with poor performance due to lack of vector code 
on some input tensor shapes. Current implementation is not able to vectorize if 
innermost loops extent is not multiple of the vector length.
   
   **What was done:**
   Core changes: it checks current loops nest, if all loops are "simple", i.e. 
loops without annotations, bindings, reduce axis, then it does the following:
    1) Fuse all loops into single one.
    2) Split this new loop into 2 parts: inner and outer. Herewith split factor 
for the inner loop is equal to '_max_vectorize_extent_' MetaScheduler parameter.
    3) Parallelize outer loop and vectorize inner loop.
   
   **Performance measurement**:
   Measurement was done on Qualcomm Snapdragon 888. As it was expected, 1 and 2 
got significant performance boost, 3 and 4 - without changes.
   
   N |    op   | Dtype |      Shape       | Before fix, ms | After fix, ms | 
speedup |
   
--|---------|-------|------------------|----------------|---------------|---------|
   1 | add     | uint8 | 1, 8, 56, 56, 32 |      1.264     |     0.167     |  
7.5x   |
   2 | qnn.add | uint8 | 1, 8, 56, 56, 32 |      2.213     |     0.336     |  
6.6x   |
   3 | add     | int32 | 1, 8, 56, 56, 32 |      0.161     |     0.150     |  
1.07x  |
   4 | seq*    | uint8 | 1, 64, 56, 56    |      2.634     |     2.679     |  
0.98x  |
   
   seq* - test of the ops sequence: qnn.conv2d + bias_add + qnn.requantize,
          weights shape = [256, 64, 1, 1]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to