On 2019/06/10 04:59:46, "Zhao, Patric" <[email protected]> wrote:
> +1 for this proposal. The operator fusion is a very common skill to improve
> the efficient memory bandwidth and reduce the latency.
>
> My suggestions:
> * Flexibility
> Because the fusion, especially pointwise fusion, is backend and device
> independent.
> It's better to make the solution more flexible and doesn't need to limit to
> only GPU backend.
> Different backend/device can provide the fusion code by themselves by
> toolchain or optimized kernel with the same path.
> In the short term, only enable the GPU kernel is fine and I will contribute
> to add the CPU code soon.
Most of the code is independent on the backend (like handling the attribute
inference and the passes that apply the fusion). The code generation is
dependent on CUDA and NVRTC but it should not be hard to make the CPU version -
I have no idea how to do JIT compilation on the CPU though, so you are welcome
to do this.
>
> * Reuse MXNET_SUBGRAPH_BACKEND env
> There are already lots of env variable and I suggest to reuse subgraph env
> which is already well known and documented.
> Such as MXNET_SUBGRAPH_BACKEND=MKLDNN is for CPU fusion now.
> https://github.com/apache/incubator-mxnet/blob/master/docs/faq/env_var.md
Will look into it.
>
> Questions:
> * " Introduce graph passes that look for subgraphs made of compatible
> pointwise ops and replace them with proper _FusedOp nodes."
> What's the "compatible pointwise ops" and is it CDUA version or HW
> independent?
Those are the ops that I have written code for. This code is actually pure
C++11 so as long as you can JIT C++11 then you could reuse that.
> Does the developer need to aware their new OP compatible?
If somebody writes a new op and does not make the fusion aware of it, then it
will not be fused. It is just performance optimization though.
>
> * " Fusion is guarded by MXNET_USE_FUSION environment variable. It should be
> decided what the default should be."
> Any hints for the user?
What would you suggest?
> Is it possible to switch off some of fusion by user or add more?
Do you mean like "fuse add and sigmoid only" or do only the pointwise fusion vs
some other fusion?
>
> Thanks,
>
> BR,
>
> --Patric
>
>
>
> > -----Original Message-----
> > From: Przemysław Trędak [mailto:[email protected]]
> > Sent: Sunday, June 9, 2019 11:57 AM
> > To: [email protected]
> > Subject: Proposal - GPU pointwise fusion
> >
> > Hello Community,
> >
> > DL models, besides compute intensive operations like convolutions and fully
> > connected layers, feature a lot of simple pointwise (aka elementwise)
> > operations (like elementwise addition etc.). Performance of those operations
> > is fully memory bandwidth bound and so it limits speedups from newer GPU
> > hardware, which typically has high compute/memory bandwidth ratio. There
> > are multiple attempts (e.g. TVM) ongoing to use compiler technology in order
> > to deal with this and other, harder performance problems. However,
> > integration of e.g. TVM into MXNet is a long term effort and there is a need
> > for a simpler, more focused, approach to deal with this problem in the
> > meantime.
> >
> > This proposal (design doc [1], PR [2]) attempts to be a short term solution
> > to
> > this problem - using existing NNVM backend to MXNet and without a big
> > refactoring required.
> >
> > Any feedback and help will be greatly appreciated.
> >
> > Thank you,
> > Przemek
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/MXNET/GPU+Pointwise+fusion
> > [2] https://github.com/apache/incubator-mxnet/pull/15167
>