+1 for this proposal. The operator fusion is a very common skill to improve the 
efficient memory bandwidth and reduce the latency.

My suggestions:
* Flexibility
Because the fusion, especially pointwise fusion,  is backend and device 
independent. 
It's better to make the solution more flexible and doesn't need to limit to 
only GPU backend.
Different backend/device can provide the fusion code by themselves by toolchain 
or optimized kernel with the same path.
In the short term, only enable the GPU kernel is fine and I will contribute to 
add the CPU code soon.

* Reuse MXNET_SUBGRAPH_BACKEND env
There are already lots of env variable and I suggest to reuse subgraph env 
which is already well known and documented.
Such as  MXNET_SUBGRAPH_BACKEND=MKLDNN is for CPU fusion now.
https://github.com/apache/incubator-mxnet/blob/master/docs/faq/env_var.md

Questions:
*  " Introduce graph passes that look for subgraphs made of compatible 
pointwise ops and replace them with proper _FusedOp nodes."
What's the "compatible pointwise ops" and is it CDUA version or HW independent?
Does the developer need to aware their new OP compatible?

* " Fusion is guarded by MXNET_USE_FUSION environment variable. It should be 
decided what the default should be."
Any hints for the user?  
Is it possible to switch off some of fusion by user or add more?

Thanks,

BR,

--Patric



> -----Original Message-----
> From: Przemysław Trędak [mailto:ptre...@apache.org]
> Sent: Sunday, June 9, 2019 11:57 AM
> To: d...@mxnet.apache.org
> Subject: Proposal - GPU pointwise fusion
> 
> Hello Community,
> 
> DL models, besides compute intensive operations like convolutions and fully
> connected layers, feature a lot of simple pointwise (aka elementwise)
> operations (like elementwise addition etc.). Performance of those operations
> is fully memory bandwidth bound and so it limits speedups from newer GPU
> hardware, which typically has high compute/memory bandwidth ratio. There
> are multiple attempts (e.g. TVM) ongoing to use compiler technology in order
> to deal with this and other, harder performance problems. However,
> integration of e.g. TVM into MXNet is a long term effort and there is a need
> for a simpler, more focused, approach to deal with this problem in the
> meantime.
> 
> This proposal (design doc [1], PR [2]) attempts to be a short term solution to
> this problem - using existing NNVM backend to MXNet and without a big
> refactoring required.
> 
> Any feedback and help will be greatly appreciated.
> 
> Thank you,
> Przemek
> 
> [1]
> https://cwiki.apache.org/confluence/display/MXNET/GPU+Pointwise+fusion
> [2] https://github.com/apache/incubator-mxnet/pull/15167

Reply via email to