Hello Community, DL models, besides compute intensive operations like convolutions and fully connected layers, feature a lot of simple pointwise (aka elementwise) operations (like elementwise addition etc.). Performance of those operations is fully memory bandwidth bound and so it limits speedups from newer GPU hardware, which typically has high compute/memory bandwidth ratio. There are multiple attempts (e.g. TVM) ongoing to use compiler technology in order to deal with this and other, harder performance problems. However, integration of e.g. TVM into MXNet is a long term effort and there is a need for a simpler, more focused, approach to deal with this problem in the meantime.
This proposal (design doc [1], PR [2]) attempts to be a short term solution to this problem - using existing NNVM backend to MXNet and without a big refactoring required. Any feedback and help will be greatly appreciated. Thank you, Przemek [1] https://cwiki.apache.org/confluence/display/MXNET/GPU+Pointwise+fusion [2] https://github.com/apache/incubator-mxnet/pull/15167
