sergey-grovety opened a new pull request #9233:
URL: https://github.com/apache/tvm/pull/9233


   # TVM operations implementation using cortex -M7 SIMD instructions
   
   ### nn.conv2d
   - We added the implementation of gemm function with 16-bit input
   - shape[-1] multiple of 4 restriction is resolved
   - There is data preparing before 8-bit intrinsic, such preparation will 
consume too much time in case of small tensor, so we add a check and simple 
cycle to handle this specific situation
   - In terms of optimization - calculations moved from inside of the intrinsic 
to outside
   - One of the buffers was radically cut(wasn't in use), lead to reducing 
memory requirements mostly in a half
   
   ### nn.max_pool2d
   - Implemented with __SSUB8 and __SEL intrinsics for four 8-bit input values, 
which is lead to notable acceleration
   - Feature: implementation ready for not 1word-aligned input data
   - Feature: ready for data sizes not a multiple of 4
   - memset is used to initialize the minimum values, to provide max speed
   
   ### nn.avg_pool2d
   - Due to lack of sum of four 8-bit values intrinsic - implementation could 
be possible only for 16-bit data
   -  __SMLAD intrinsic used to process two 16-bit values
   - Feature: implementation ready for not 1word-aligned input data
   - Feature: ready for data sizes not a multiple of 4
   
   ### nn.dense
   Implemented with same gemm method, described above
   
   ### nn.conv1d
   Specific case of gemm usage - with one of data dimensions equal to 1
   
   ### nn.avg_pool1d
   Implemented for NCW layout with same intrinsic as for 2d version of operation
   
   ### nn.max_pool1d
   Implemented with same intrinsic as for 2d version of operation
   
   
   # Benchmarking:
   To enable intrinsic code generation you should specify _-march=armv7e-m _ 
flag
   HW platform: STM32F746 Nucleo; GCC10, optimization flags: -O3
   If you want to enable intrinsic you should specify -march parameter of the 
target:
   `target_str = f"c -keys=arm_cpu -mcpu=cortex-m7 -march=armv7e-m 
-model=stm32f746xx -runtime=c -link-params=1 --executor=aot --unpacked-api=1 
--interface-api=c"`
   
   ## Results
   
   ms | No Intrinsic | Intrinsic enabled
   -- | -- | --
   mnist8 | 8.625 | 6.574
   cifar10 | 788.36 | 144.59
   
   
   </div><ul><li>No intrinsic: march parameter not specified, no code generated 
for Intrinsic</li><li>Intrinsic enabled: <em>march=armv7e-m</em></li></ul>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to