DickJC123 edited a comment on issue #15167: Pointwise fusion for GPU URL: https://github.com/apache/incubator-mxnet/pull/15167#issuecomment-551332574 After some investigation, I have an explanation and planned fix for the perf regression. To repeat what @ptrendx mentions, the real-time compilation of fused kernels takes additional time up front, with the idea that over many kernel invocations, the compile time will be more than made up for by the increased efficiency of the fused op. This matches the typical use case (unlike CI), so I believe that fusion should be left enabled by default. Now one saving thing for each of the 3 tests mentioned by @rondogency is that most of the fused-ops in the test are duplicates of others seen earlier in the test. In fact <2% of the created fused-ops are unique. To fix this then I will submit a PR to introduce a 'fused op cache' that will map (source-code, gpu_arch) -> runnable kernel. This should eliminate most of the runtime compilations and correct in large part the issue flagged here.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
