[jira] [Created] (MXNET-1446) Quantization: intgemm matrix multiply wrappers

Kenneth Heafield (Jira) Mon, 10 Feb 2020 07:12:01 -0800

Kenneth Heafield created MXNET-1446:
---------------------------------------


             Summary: Quantization: intgemm matrix multiply wrappers
                 Key: MXNET-1446
                 URL: https://issues.apache.org/jira/browse/MXNET-1446
             Project: Apache MXNet
          Issue Type: Improvement
          Components: Apache MXNet Backend
            Reporter: Kenneth Heafield


intgemm is an 8-bit and 16-bit matrix multiplication library for x86 CPUs: 
[https://github.com/kpu/intgemm] .  This issue proposes adding wrappers. 

A performance comparison with DNNL aka MKL-DNN is at 
[https://github.com/kpu/intgemm/issues/59]

The library targets thin matrix sizes seen in neural machine translation 
inference and was part of the top submission to the 2018 Workshop on Neural 
Generation and Translation efficiency task:  
[https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf] .  The purpose of 
this issue is to add similar functionality to Sockeye: 
[https://github.com/awslabs/sockeye/pull/771] . 

Quantized Sockeye performance is 2.95x as fast.  One problem with the current 
MXQuantizeSymbol approach is that Sockeye does not have a static graph for 
everything.

intgemm uses a custom memory layout for the weight matrix to make more memory 
accesses consecutive, so there are operators to convert weights to that format. 
 The idea is that weights are typically loaded once for inference. 

On architectures without VNNI, intgemm uses saturating 16-bit accumulation. 
This avoids an expensive madd_epi16 instruction every multiply by exploiting 
the fact that most neural network parameters are near 0.

Because x86 only offers a unsigned * signed instruction and most people want 
signed * signed, there are two strategies one can take. 
 # Add 128 to data so now it's unsigned.  But that biases the output.  DNNL 
calculates this bias on the fly by summing weights then subtracts it out during 
GEMM.  intgemm calculates this bias in advance, which can then be subtracted 
from the bias term with no overhead at runtime.  A problem with this strategy 
is that it makes the accumulator bigger, requiring more upcasting with an 
expensive madd_epi16 instruction. 
 # Emulate signed * signed by normalizing the sign bit into the second 
argument. This requires extra instructions in the hot loop but keeps the 
accumulator small, so it's less necessary to accumulate into 32-bit integers 
and madd_epi16 can be avoided. 

Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2. 

Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, 
AVX2, AVX512BW, and AVX512VNNI. 

Github pull request coming. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (MXNET-1446) Quantization: intgemm matrix multiply wrappers

Reply via email to