Great. Let's coordinate to keep our efforts aligned. On Wed, Jan 31, 2018 at 9:51 PM, Zhao, Patric <[email protected]> wrote:
> Thanks, Jun, please see my comments inline. > > > > Wenting and Jin will follow up the tasks in the PR. > > > > *From:* Jun Wu [mailto:[email protected]] > *Sent:* Thursday, February 1, 2018 12:40 PM > *To:* [email protected] > *Cc:* Ye, Jason Y <[email protected]>; Lv, Tao A <[email protected]>; > Jiang, Wenting <[email protected]>; Zhao, Patric < > [email protected]> > *Subject:* Re: Intel Plan for the contribution to MXNET > > > > Hi Patric, > > > > Thanks for the contribution. It’s great to see actions on developing INT8 > inference for CPU! I have a few questions and hope to have your answers. > > > > 1. When you said your work is aligned with PR9552 > <https://github.com/apache/incubator-mxnet/pull/9552>, did you mean you > used quantization+calibration flows developed in that PR for benchmarking > inferences? > > [Patric] The benchmark accuracy is based on MKLDNN and ziheng’s old > quantization branch. > > Now we have merged to master (based on #8302) with > quantization+calibration PR for int8 development, will show you the > accuracy and performance soon. > > > > 2. In you MNIST benchmark, what operators are quantized? > > [Patric] Conv, relu and flatten are quantized in our mnist benchmark > (conv+relu+flatten+FC+softmax). > > Besides, MKLDNN supports pooling, concat and fused(conv with relu/elem/bn) > int8 ops. > > > > 3. Is the MNIST quantized model calibrated? > > [Patric] Not yet, we did the experiment on ziheng’s old quantization > branch, now we are moving to branch of quantization+calibration PR. > > > > 4. Is the inference accuracy of INT8 produced by the *calibrated* > quantized model, or just quantized model without calibration? > > [Patric] Without calibration > > > > 5. What are the throughputs of FP32 model and INT8 model for > inference, respectively? > > [Patric] In this stage, we are mainly focus on the accuracy and algorithm. > The performance fine tune is on the way J > > > > Thanks, > > Jun > > > > On Wed, Jan 31, 2018 at 8:08 PM, Zhao, Patric <[email protected]> > wrote: > > Hi MXNET developers, > > We are from Intel Software and Service Group (SSG) and working on the > performance optimization for MXNET on Intel Architecture (IA). > > Let me do a simple introduction about our ongoing projects. > > Any suggestions and comments are highly appreciated. > > > 1) MKL-DNN integration with new NNVM interface > > We have designed a new interface of MKL-DNN by NNVM with Zheng-Da together. > > The new implementation shows the better performance and flexibility than > old MKL engine. > > > > The PR is under review (https://github.com/apache/ > incubator-mxnet/pull/8302) and very thanks for your great comments in the > thread :) > > After the PR is merged, we will push more MKL-DNN related features and > performance optimization strategies, such as fused conv + relu OP for the > inference. > > > > 2) INT8 inference > > MKL-DNN also provides the int8 calculations such as for conv, relu, > pooling which can improve the inference performance a lot within very > slight accuracy drop (like <1%). > > Currently, we have implemented quantization, de-quantization, and some > computing Ops in local branch. > > Our latest implementation is aligned with this PR ( > https://github.com/apache/incubator-mxnet/pull/9552) and passed the unit > test. > > > > For a simple network (conv+relu+flatten+FC+softmax) with mnist dataset, we > got very similar inference accuracy (FP32,98.06% .vs. INT8, 97.93%). > > We will update a summary of our solution in this PR soon. > > > > I hope both CPU and GPU can be compatible and share the common code base > together. So, I think we need more discussion in the PR :) > > > > 3) RNN implementations > > Currently, there is no CPU implementation for mx.sym.rnn and the python > implementation is really slower. > > We are working on resolving this issue from two aspects.: > > - Provide the C/C++ level implementation, registering by > FCompute<cpu> (GPU code should be moved to NNVM as well). > > We plan to PR the LSTM/GRU in the March and our initial results as below, > FYI > Size :N = 12, T = 1600, I = 161, H = 1760 (from the first > layer of deep speech 2) > Forward > > mx.sym.gru binded Intel GRU C(s) > > Native mx.rnn.GRUCell(s) > > SKX 6148, 2 socket > > 1.32 > > 72.7 > > > > > - Provide the MKL-DNN RNN interface (under development, > https://github.com/intel/mkl-dnn/issues/46), registering by > FComputeEx<cpu> > > The higher performance RNN is under development by MKL-DNN team. And we > will merge it when it's ready. > > I think the CPU user can get further performance boost by MKL-DNN library. > > Thanks in advance! > > BR, > > -- Patric > > >
