FrozenGene edited a comment on pull request #5485: URL: https://github.com/apache/incubator-tvm/pull/5485#issuecomment-621791710
For performance, have you tried some other layouts on GPU? I have some exp on CPU. The more suitable layout on CPU of NHWC input is: ``` input_tile: alpha, alpha, P, CI data_pack: alpha, alpha, P, CI bgemm: alpha, alpha, P, CO inverse: m, m, P, CO output: N H W CO kernel: alpha alpha CO CI ``` For kernel, I design `alpha alpha CO CI`, because I want to vectorize CI. Maybe on GPU, alpha alpha CI CO is better. I test your layout compared the layout I mentioned, your layout on skylake-512 is 0.388ms, but my layout I mentioned is 0.375ms. I use 20 threads on workload (1, 56, 56, 64, 64). The performance could be reproduced stabilized. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org