FrozenGene commented on issue #4277: [ARM][Topi] Improving Int8 Perf in Spatial 
Conv2D schedule.
URL: https://github.com/apache/incubator-tvm/pull/4277#issuecomment-551359926
 
 
   > @jackwish I'd be very interested in those results. I got some good results 
for NHWC on ARMv7 by porting the QNNPACK kernels over
   > and tensorizing 
(https://github.com/ajtulloch/tvm/blob/95e5e2d44a08e2dfb8444706370505944ffb7c91/topi/python/topi/arm_cpu/conv2d_int8.py#L9-L166),
 and it'd be awesome to see how you folks have approached this problem.
   
   @ajtulloch Thanks for the interest and great discussion between us ever. :-) 
   
   I want to summary some high idea of us and will present the results next TVM 
meetup in Shanghai.
   
   For Convolution:
   1. We use NHWC layout
   2. Currently, we use Tensorize.
   
   We stuied QNNPACK, but QNNPACK can not be used by us directly, some concept 
in QNNPACK we can not simulate, for example, indirect buffer. So we write the 
kernel by ourselves.
   
   For Depthwise Convolution
   1. We use NHWC layout
   2. We don't use Tensorize.
   
   Yes. We use INT6 * INT16 + INT16 -> INT32 instruction (SMLAL), which is 
better than INT32*INT32 + INT32->INT32. The way we do is we will substract the 
input_zero_point / kernel_zero_point before computation, at there, we will cast 
the dtype from UINT8 -> INT16.
   
   For Depthwise convolution, even though we don't use Tensorize, we still get 
the performance bettern than QNNPACK(in mobilenet V1 / mobilenetV2, only 2 
layers slower than it, others we are faster than QNNPACK). Amazing result. I 
wanna list two keypoints:
   
   1. Avoid data pack. In im2col / spatial pack, we will do data pack on H / W, 
which is cost on depthwise convolution, you could compute it directly and just 
split C. i.e. like this:
   ```
       kvshape = (C // VC, M, KH, KW, VC)
       oshape = (N, OH, OW, C)
       dvshape = (N, OH, OW, C // VC, KH, KW, VC)
   ```
   2. compute_at is very important in depthwise convoltion. `data_pad_inline` / 
`data_vec_inline` / `conv_inline` should be tunable, this is one important 
factor to beyond QNNPACK. 
   
   Currently, we have tested MobilenetV2 on rasp, we are 1.34X compared with 
QNNPACK. In our in-house model, we are beyond more compared with QNNPACK. We 
will present more in TVM meetup.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to