kpuatamazon opened a new pull request #17559: [MXNET-1446] Quantization: 
intgemm matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559
 
 
   ## Description ##
   This pull request adds wrappers to the intgemm matrix multiplication 
library: https://github.com/kpu/intgemm .  
   
   A performance comparison with DNNL aka MKL-DNN is at 
https://github.com/kpu/intgemm/issues/59
   
   The library targets thin matrix sizes seen in neural machine translation 
inference and was part of the top submission to the 2018 Workshop on Neural 
Generation and Translation efficiency task:  
https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf .  The purpose of this 
issue is to add similar functionality to Sockeye: 
https://github.com/awslabs/sockeye/pull/771 . 
   
   Quantized Sockeye performance is 2.95x as fast.  One problem with the 
current MXQuantizeSymbol approach is that Sockeye does not have a static graph 
for everything.
   
   intgemm uses a custom memory layout for the weight matrix to make more 
memory accesses consecutive, so there are operators to convert weights to that 
format.  The idea is that weights are typically loaded once for inference. 
   
   On architectures without VNNI, intgemm uses saturating 16-bit accumulation. 
This avoids an expensive madd_epi16 instruction every multiply by exploiting 
the fact that most neural network parameters are near 0.
   
   Because x86 only offers a unsigned * signed instruction and most people want 
signed * signed, there are two strategies one can take. 
   
       Add 128 to data so now it's unsigned.  But that biases the output.  DNNL 
calculates this bias on the fly by summing weights then subtracts it out during 
GEMM.  intgemm calculates this bias in advance, which can then be subtracted 
from the bias term with no overhead at runtime.  A problem with this strategy 
is that it makes the accumulator bigger, requiring more upcasting with an 
expensive madd_epi16 instruction. 
       Emulate signed * signed by normalizing the sign bit into the second 
argument. This requires extra instructions in the hot loop but keeps the 
accumulator small, so it's less necessary to accumulate into 32-bit integers 
and madd_epi16 can be avoided. 
   
   Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 
2. 
   
   Similar to DNNL, intgemm has runtime CPUID selection among backends for 
SSSE3, AVX2, AVX512BW, and AVX512VNNI. 
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [x] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to 
the relevant [JIRA issue](https://issues.apache.org/jira/projects/MXNET/issues) 
created (except PRs with tiny changes)
   - [x] Changes are complete (i.e. I finished coding on this PR).
   - [ ] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding 
a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing 
distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a 
new build option with NCCL)
   - [ ] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments 
are documented. 
   - For new examples, README.md is added to explain the what the example does, 
the source of the dataset, expected performance on test set and reference to 
the original paper if applicable
   - Check the API doc at 
https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
   - [x] To the best of my knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [x] submodule for intgemm
   - [x] `intgemm_prepare_data` and `intgemm_prepare_weight` operators to 
convert operands from fp32
   - [x] `intgemm_take_weight` for taking weights in intgemm's weight format, 
which is useful for vocabulary shortlists in Sockeye.  
   - [x] `intgemm_fully_connected` for matrix multiply
   
   ## Comments ##
   Backward compatible.
   intgemm requires the inner dimension be a multiple of 64 for efficiency and 
alignment reasons.  Currently the outputs must be a multiple of 8 but there is 
in-progress code in intgemm to remove that.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to