Agreed with Tianqi that we could have better implementation once we have better tvm nnvm v2 integration. For now I believe that we shouldn't block the development of Intel folks.
On Tue, Apr 9, 2019 at 10:10 PM Tianqi Chen <tqc...@cs.washington.edu> wrote: > Such kind of conversion can be viewed as an enhanced version of > AlterOpLayout in the TVM relay Pass > > On Tue, Apr 9, 2019 at 8:03 PM Lv, Tao A <tao.a...@intel.com> wrote: > > > > > Thank you Tianqi and Sam for the kind suggestions. > > > > @Tianqi, > > > > Can you please point me to the code of this pass or do you think anyone > > from TVM community can help to educate me on this? I'm very happy to > learn > > from that. > > > > Just one note, we are not only doing layout transformation but also want > > to have more memory for layout transformation. > > For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16, > > H=256, W=256) on channel dimension then convert (N=32, C=16, H=256, > W=256) > > to nchw16c so we can leverage corresponding optimal computation kernels. > > That's why we also need changes to the memory planning pass. > > > > > > @Sam, > > > > Yes, definitely we're treating MKL-DNN as an accelerator on CPU. > > Previously we used it to accelerate certain critical operators in MXNet > in > > certain situations, eg. FP32 convolution/deconvolution/fullyConnected, > etc. > > But along with the evolving of both MXNet and MKL-DNN, we started to do > > more which might not supported by MXNet in original CPU implementation, > > such as quantization and graph fusion. So MKL-DNN backend is also > changing > > from a simple `accelerator` to a `default` backend on CPU. And I totally > > agree with you that we need think more about the software architecture > for > > maintainability, testability and readability - that's why I sent out this > > proposal to get more ideas from the community. > > > > > > -tao > > > > -----Original Message----- > > From: Skalicky, Sam [mailto:sska...@amazon.com.INVALID] > > Sent: Wednesday, April 10, 2019 2:24 AM > > To: dev@mxnet.incubator.apache.org > > Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType > > and memory planning pass > > > > I agree with Tianqi. We should let MKLDNN partitipate in memory planning > > by first having a separate NNVM pass and then using that info in the > > regular memory planning phase. > > > > Its starting to sound like MKLDNN should be treated like an accelerator > > rather than an operator library. As it has explicit needs and can provide > > acceleration when given extra capabilities in MXNet like having input to > > the memory planning NNVM pass. It also has special tensor formatting > needs > > and conversions that could be best architected in another way than they > > currently are. > > > > We need to think about how we want to architect this for maintainability, > > testability, and readability. > > > > Sam > > > > > > > On Apr 9, 2019, at 11:11 AM, Tianqi Chen <tqc...@cs.washington.edu> > > wrote: > > > > > > The layout transformation should really be a separate optimization > > > pass rather than memory planning. As is done in the TVM stack. If we > > > want to do a clean slate solution, I would recommend looking into that > > instead. > > > > > > TIanqi > > > > > > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <tao.a...@intel.com> wrote: > > > > > >> > > >> > > >> Hi dev, > > >> > > >> > > >> > > >> As we're discussing the roadmap for MXNet 2.0, I would like to start > > >> a thread about refining the InferStorageType and memory planning pass > > >> in MXNet and hope it can happen as a part of the 2.0 release. > > >> > > >> > > >> > > >> Thanks to @eric-haibin-lin, part of the proposal has already been > > >> discussed in issue #13598 [1]. > > >> > > >> > > >> > > >> As mentioned in the description of issue #13598, there are several > > >> drawbacks of the existing flow. Please allow me to quote them here: > > >> * the selection of MKL/CPU/GPU/CUDNN implementation happens > after > > >> graph attribute inference and memory planning, memory planning is > > >> thus not aware of the implementation that will be used for execution > > >> in the future, which may result in sub-optimal result. For example, > > >> the memory inplace option may vary depending on the accelerator > > >> backend (the new version of CUDNN enables x/dx inplace for > > _backward_conv). > > >> * some sparse operator need to access dtype/shape information > to > > >> decide which implementation to invoke for execution, and whether to > > >> perform fallback. This information is not yet exposed in the existing > > >> infer storage type interface. > > >> > > >> > > >> > > >> Besides, the existing memory planning pass calculates and afterwards > > >> allocates memory strictly according to the input/output tensor shapes > > >> (which can be got from operators' arithmetic formulas through > > InferShape). > > >> That's not true anymore when we come to accelerators like MKL-DNN on > > >> CPU which wants to pad input/output tensor to optimal formats (eg. > > >> nchw16c) according to hardware architecture. It also can be described > > >> as shape + stride. As many of you know, MKL-DNN shows great > > >> performance on these optimal formats which is blocked by the vector > > length of AVX512 or AVX2. > > >> It's very natural for us to pad on the channel dimension for those > > >> inputs/outputs which IC or OC is not multiples of vector length and > > >> leverage optimal kernels for blocked formats. Unfortunately this > > >> cannot be implemented without changing the logic in the memory > planning > > pass. > > >> Currently we always fallback to slow reference kernels for both > > >> convolution [1] and deconvolution [2]. > > >> > > >> > > >> > > >> AFAIK, the padding feature of MKL-DNN has already been used in > > >> TensorFlow and other frameworks. We also found that, without > > >> supporting this feature, many other new features from MKL-DNN cannot > > >> be applied to MXNet, such as the deconvolution primitive, winograd, > > etc. > > >> > > >> > > >> > > >> Changes for this proposal can be divided into following parts: > > >> 1. Following the proposal in issue #13598, we need add new > > >> InferStorageTypeEx functions to operators which need to do dispatch > > >> in a more fine-grained way. This also need the InfereStorage pass can > > >> handle the new -Ex function as what we did for FCompute and > FComputeEx. > > >> 2. Attach more information to the computation graph/node, eg. > > >> accelerator specific information. Currently we add `IsMKLDNN` > > >> directly during operator registration if MXNET_USE_MKLDNN == 1. It > > >> looks simple and rude to me. > > >> 3. Do memory planning according to more information: topology, > > >> shapes, data types, in-place options and more accurate accelerator > > >> information (accelerator path, memory size requirements, > > >> accelerator-wise attributes). > > >> 4. Improve MKL-DNN operators so they can work on those well > planned > > >> memory which may be larger than the arithmetic requirements and work > > >> with optimal kernels. Also, with more accurate dispatching in > > >> InferStorageTypeEx, there is no need for us to write complicated > > >> fallback logic in MKL-DNN operators. > > >> 5. If users feel uncomfortable with more memory usage, we can > > disable > > >> this feature by environmental variables. > > >> > > >> > > >> > > >> Since the memory planning pass is implemented in NNVM, so we also > > >> need support from TVM community. > > >> > > >> > > >> > > >> Please let me know what do you think. Thank you. > > >> > > >> > > >> > > >> -tao > > >> > > >> > > >> > > >> [1] https://github.com/apache/incubator-mxnet/issues/13598 > > >> > > >> [2] > > >> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn > > >> /mkldnn/mkldnn_convolution.cc#L194 > > >> > > >> [3] > > >> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn > > >> /mkldnn/mkldnn_deconvolution.cc#L55 > > >> > > >> > > > > >