Anndrey24 opened a new pull request, #16106:
URL: https://github.com/apache/tvm/pull/16106

   Implemented an `arm_cpu` conv2d NHWC schedule for fp32 using a hybrid GeMM 
approach, effectively breaking down the matrix multiplication into a 
macro-kernel (partitioning into fixed-sized, tile-level subproblems) and a 
micro-kernel (independently dealing with each subproblem). After the im2col 
transformation, the input matrix is handled natively (not interleaved), while 
the weights matrix is tiled and interleaved at compile time.  
   The micro-kernel uses 16 registers to accumulate the results of each 4x16 
output tile, cycling through the operands needed to compute them (from the 
input and weight matrices) in the remaining registers.
   
   There are now two ways to transform the weights matrix for conv2d, which are 
detailed in `convolution.cc`:
   * for fp32: tile, interleave
   * for int8: tile, interleave, transpose
   
   To maintain naming consistency across both of these implementations 
(transposed vs not transposed), all mentions of `tile_rows_B` or `tile_cols_B` 
have been changed to `tile_N` and `tile_K` respectively to denote the tiling 
size along each axis of the flattened B matrix. As usual, `N = out_channels` 
and `K = kernel_width * kernel_height * in_channels`.
   
   I have also added a new conv2d NHWC fp32 test for both the 
`conv2d_nhwc_spatial_pack` and `conv2d_NHWC_fp32_hybrid` schedules.  
   
   cc @ekalda @lhutton1 @neildhickey @leandron


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to