Kenneth Heafield created MXNET-1446:
---------------------------------------
Summary: Quantization: intgemm matrix multiply wrappers
Key: MXNET-1446
URL: https://issues.apache.org/jira/browse/MXNET-1446
Project: Apache MXNet
Issue Type: Improvement
Components: Apache MXNet Backend
Reporter: Kenneth Heafield
intgemm is an 8-bit and 16-bit matrix multiplication library for x86 CPUs:
[https://github.com/kpu/intgemm] . This issue proposes adding wrappers.
A performance comparison with DNNL aka MKL-DNN is at
[https://github.com/kpu/intgemm/issues/59]
The library targets thin matrix sizes seen in neural machine translation
inference and was part of the top submission to the 2018 Workshop on Neural
Generation and Translation efficiency task:
[https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf] . The purpose of
this issue is to add similar functionality to Sockeye:
[https://github.com/awslabs/sockeye/pull/771] .
Quantized Sockeye performance is 2.95x as fast. One problem with the current
MXQuantizeSymbol approach is that Sockeye does not have a static graph for
everything.
intgemm uses a custom memory layout for the weight matrix to make more memory
accesses consecutive, so there are operators to convert weights to that format.
The idea is that weights are typically loaded once for inference.
On architectures without VNNI, intgemm uses saturating 16-bit accumulation.
This avoids an expensive madd_epi16 instruction every multiply by exploiting
the fact that most neural network parameters are near 0.
Because x86 only offers a unsigned * signed instruction and most people want
signed * signed, there are two strategies one can take.
# Add 128 to data so now it's unsigned. But that biases the output. DNNL
calculates this bias on the fly by summing weights then subtracts it out during
GEMM. intgemm calculates this bias in advance, which can then be subtracted
from the bias term with no overhead at runtime. A problem with this strategy
is that it makes the accumulator bigger, requiring more upcasting with an
expensive madd_epi16 instruction.
# Emulate signed * signed by normalizing the sign bit into the second
argument. This requires extra instructions in the hot loop but keeps the
accumulator small, so it's less necessary to accumulate into 32-bit integers
and madd_epi16 can be avoided.
Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2.
Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3,
AVX2, AVX512BW, and AVX512VNNI.
Github pull request coming.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]