FrozenGene commented on pull request #5485:
URL: https://github.com/apache/incubator-tvm/pull/5485#issuecomment-621791710


   For performance, have you tried some other layouts? I have some exp on CPU. 
The more suitable layout on CPU of NHWC input is:
   
   ```
     input_tile: alpha, alpha, P, CI
     data_pack: alpha, alpha, P, CI
     bgemm: alpha, alpha, P, CO
     inverse: m, m, P, CO
     output: N H W CO
     kernel: alpha alpha CO CI
   ```
   For kernel, I design `alpha alpha CO CI`, because I want to vectorize CI. 
Maybe on GPU, alpha alpha CI CO is better.
   
   I test your layout compared the layout I mentioned, your layout on 
skylake-512 is 0.388ms, but my layout I mentioned is 0.375ms. I use 20 threads 
on workload (1, 56, 56, 64, 64). The performance could be reproduced stabilized.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to