ibsidorenko opened a new pull request, #12714:
URL: https://github.com/apache/tvm/pull/12714

   **Motivation**:
   In case of quantized models nn.pad operation typically is not fused with QNN 
ops and lives as a standalone operation. In this case it uses default injective 
schedule for Hexagon target and it is not optimized very well for some 
shapes/layouts (based on analysis of real models like ResNet50 INT8).
   
   **What was done:**
   New schedule for Pad operation was implemented instead of default injective 
schedule. For Hexagon target injective schedule does fusion of all axis and 
vectorization on 128/64/32 (depends on dtype). It works fine for Add, Sub, 
etc... but not for Pad. New optimized schedule does these steps 
(fusion+vectorization) only if last tensor dimension is divisible by 128/64/32 
(depends on dtype). It was done only for Hexagon, for other targets (x86, cuda, 
etc.) there is no changes and it uses default injective schedule.
   
   **Benchmark results on Snapdragon 888:**
   
   4d NHWC layout with ((0, 0), (1, 1), (1, 1), (0, 0)) padding, "uint8" dtype:
   
   shape              | default schedule, ms | optimized schedule, ms |      
speedup      |
   
-------------------|----------------------|------------------------|-------------------|
   (1, 112, 112, 32)  |         10,03        |           0.2          |    
50.1x times    |
   (1, 56, 56, 128)   |         0,099        |          0,085         |  ~1x 
(no speedup) |
   
   
   4d NCHW layout with ((0, 0), (0, 0), (1, 1), (1, 1)) padding, "uint8" dtype:
   
   shape              | default schedule, ms | optimized schedule, ms |      
speedup      |
   
-------------------|----------------------|------------------------|-------------------|
   (1, 128, 56, 56)   |         10.96        |          1.38          |    7.9x 
times     |
   (1, 32, 126, 126)  |          1.66        |          1.58          |  ~1x 
(no speedup) |
   (1, 32, 128, 128)  |         13.98        |          2.66          |    
5.25x times    |
   
   
   5d NCHWc layout with ((0, 0), (0, 0), (1, 1), (1, 1), (0, 0)) padding, 
"uint8" dtype:
   
   shape              | default schedule, ms | optimized schedule, ms |      
speedup      |
   
-------------------|----------------------|------------------------|-------------------|
   (1, 4, 56, 56, 32) |          6.39        |          0.29          |     22x 
times     |
   (1, 56, 56, 128)   |          0.15        |          0.15          |  ~1x 
(no speedup) |
   
   
   **Summary:**
   For some input tensors we get up to 50x times speedup, for other performance 
is the same.
   No performance degradations were detected.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to