Aharrypotter opened a new pull request, #19538:
URL: https://github.com/apache/tvm/pull/19538
## Summary
This PR adds quantized TFLite operator import support to the Relax frontend
by
replacing all placeholder `_qnn.op.*` references with a QDQ (dequantize →
float op → quantize) decomposition using the existing `relax.op.quantize` /
`relax.op.dequantize` infrastructure.
Previously, the frontend raised `NotImplementedError` at the tensor-parsing
stage whenever a TFLite model contained quantization metadata, making
quantized
models completely unreachable. After this PR, quantized tensor metadata
(`scale`, `zero_point`, and per-axis `QuantizedDimension`) is preserved
through
the frontend.
### Design decision: QDQ vs fused QNN
Relax already has `R.quantize` / `R.dequantize` operators with full C++
registration, Python API, legalization, and tests. Rather than introducing
new
fused `qnn.conv2d` / `qnn.dense` / `qnn.requantize` ops (which would require
5
new Relax operators with C++ attrs, FFI, StructInfo, legalization, and an RFC
discussion), this PR reuses the existing QDQ ops and decomposes quantized
TFLite operators into:
```
dequantize → float Relax op → quantize
```
This has zero C++ changes, keeps the PR small, and defers fused-QNN ops to a
potential follow-up RFC if backend int8 kernel optimization is needed later.
### Converters updated
Eighteen `_qnn.op.*` call sites across 12 converter paths:
| Converter | `_qnn.op.*` removed | QDQ replacement |
|-----------|-------------------|-----------------|
| `quantize` / `dequantize` helpers | `_qnn.op.quantize`,
`_qnn.op.dequantize` | `relax.op.quantize` / `relax.op.dequantize` + `axis` |
| `convert_relu` | `_qnn.op.requantize` | DQ → `R.nn.relu` → Q |
| `convert_relu6` | `_qnn.op.requantize` | DQ → `R.clip(0,6)` → Q |
| `convert_relu_n1_to_1` | `_qnn.op.requantize` | DQ → `R.clip(-1,1)` → Q |
| `convert_reshape` (uint8) | `_qnn.op.requantize` | DQ → `R.reshape` → Q |
| `_convert_reduce` | `_qnn.op.requantize` | DQ → reduce → Q |
| `convert_conv` | `_qnn.op.conv2d`, `_qnn.op.requantize` | DQ →
`R.nn.conv2d` → Q |
| `convert_fully_connected` | `_qnn.op.dense`, `_qnn.op.requantize` | DQ →
`R.matmul` → Q |
| `convert_concatenation` | `_qnn.op.concat` | DQ each → `R.concat` → Q |
| `convert_transpose_conv` | `_qnn.op.conv2d_transpose`,
`_qnn.op.requantize` | DQ → `R.nn.conv2d_transpose` → Q |
| `convert_detection_postprocess` | `_qnn.op.dequantize` (×3) |
`self.dequantize` |
All `_qnn.op.*` references are eliminated. The `# ruff: noqa: F821` comment
is
removed.
Closes #19534.
## Changes
### 1. Preserve tensor quantization metadata
- `get_tensors()`: read `scale`, `zero_point`, and `QuantizedDimension()`
from
TFLite `QuantizationParameters` and store them in
`TensorWrapper.qnn_params`
as `{"scale", "zero_point", "axis"}`.
- Remove the global `NotImplementedError` guard that blocked all quantized
models at the tensor-parsing stage.
### 2. Replace `quantize` / `dequantize` frontend helpers
- `OperatorConverter.quantize()`: `_qnn.op.quantize` → `relax.op.quantize`,
adding the per-axis `axis` parameter from `qnn_params`.
- `OperatorConverter.dequantize()`: `_qnn.op.dequantize` →
`relax.op.dequantize`, adding `axis`.
### 3. Simple elementwise / reshape / reduce requantize paths
- `convert_relu`, `convert_relu6`, `convert_relu_n1_to_1`: QNN fused
activation + requantize → DQ → float op → Q.
- `convert_reshape` (uint8 with differing input/output qparams): requantize
on
integer tensor → DQ → reshape → Q.
- `_convert_reduce`: int32 cast + requantize → DQ → reduce → Q.
### 4. Quantized Conv2D via QDQ decomposition
- `convert_conv`: `_qnn.op.conv2d` + `_qnn.op.requantize` →
DQ input + DQ weight (per-channel axis remap OC 0→3 for HWIO layout) +
`R.nn.conv2d` → Q.
- INT32/INT64 bias dequantized with `input_scale × weight_scale` before
adding
to the float conv output.
- Per-channel depthwise convolution raises `OpNotImplemented` because the
`[1,KH,KW,C*M] → [KH,KW,C,M]` reshape changes axis semantics.
### 5. Remaining quantized ops
- `convert_fully_connected`: `_qnn.op.dense` + `_qnn.op.requantize` →
DQ input + DQ weight (axis remap OC 0→1) + `R.matmul` → Q.
INT32/INT64 bias dequantized with `input_scale × weight_scale`.
- `convert_concatenation`: `_qnn.op.concat` → DQ each input → float concat →
quantize → activation.
- `convert_transpose_conv`: `_qnn.op.conv2d_transpose` +
`_qnn.op.requantize` →
DQ input + DQ weight (axis remap OHWI→IOHW, OC 0→1) +
`R.nn.conv2d_transpose` → Q. INT32/INT64 bias dequantized.
Also fixes a latent bug: `relax.op.nn.bias_add` → `relax.op.add`.
- `convert_detection_postprocess`: 3× inline `_qnn.op.dequantize` →
`self.dequantize`.
### 6. Cleanup
- Removed the `# ruff: noqa: F821` suppression comment (zero remaining `_qnn`
or `_expr` undefined-name references).
- Removed unused locals (`weight_shape` in FC, `output_tensor_type_str` in
`convert_quantize`).
## Axis remap table
Per-channel weight quantization requires remapping `QuantizedDimension()`
after
TFLite-to-Relax layout transforms:
| Op | TFLite layout | Relax layout | Transform | Original axis → Remapped
axis |
|----|--------------|-------------|-----------|------------------------------|
| Conv2D | [OC, KH, KW, IC] | [KH, KW, IC, OC] (HWIO) | transpose (1,2,3,0)
| 0 → 3 |
| FullyConnected | [OC, IC] | [IC, OC] | permute_dims [1,0] | 0 → 1 |
| TransposeConv | [OC, KH, KW, IC] (OHWI) | [IC, OC, KH, KW] (IOHW) |
permute_dims [3,0,1,2] | 0 → 1 |
| DepthwiseConv | [1, KH, KW, C×M] | [KH, KW, C, M] (HWOI) | reshape |
unsupported (OpNotImplemented) |
## Known Limitations
- **Per-channel depthwise convolution**: raises `OpNotImplemented` because
the `[1,KH,KW,C×M] → [KH,KW,C,M]` reshape changes per-channel axis
semantics in a way that `R.dequantize` cannot express directly.
- **Per-channel bias scale**: INT32/INT64 bias dequantization uses a scalar
`input_scale × weight_scale`. For per-channel weights this should be a
per-channel scale; the current fallback uses a scalar approximation.
- **No end-to-end numerical validation**: the structural tests verify that
the
IR uses the expected QDQ pattern, but there is no numerical comparison
against a TFLite reference output. This is deferred to follow-up work.
## Testing
Successful-conversion tests use manually-built minimal TFLite flatbuffers
with `tvm.ir.assert_structural_equal`. Unsupported-boundary tests use
`pytest.raises`.
### New tests (11)
| Test | Covers |
|------|--------|
| `test_tensor_quantization_parameters_are_parsed` | per-tensor + per-axis
metadata parsing |
| `test_quantize_op_uses_relax_quantize` | TFLite QUANTIZE → `R.quantize` |
| `test_dequantize_op_uses_relax_dequantize` | TFLite DEQUANTIZE →
`R.dequantize` |
| `test_quantized_conv2d_per_tensor_uses_qdq` | Conv2D DQ→conv2d→Q (no bias)
|
| `test_quantized_conv2d_with_int32_bias_dequantizes_bias` | Conv2D INT32
bias DQ |
| `test_quantized_concat_uses_qdq` | Concat DQ each→concat→Q |
| `test_per_channel_depthwise_conv_unsupported` | Per-channel depthwise →
OpNotImplemented |
| `test_uint8_reshape_requantize_uses_dq_reshape_q` | uint8 RESHAPE
DQ→reshape→Q |
| `test_transpose_conv_with_int32_bias_dequantizes_bias` | TRANSPOSE_CONV
INT32 bias DQ |
| `test_quantize_op_requantize_uses_dq_q` | TFLite QUANTIZE as requantize →
DQ→Q |
| `test_quantized_fully_connected_with_int32_bias_dequantizes_bias` | FC
INT32 bias DQ |
### Commands
```bash
python -m pytest tests/python/relax/test_frontend_tflite.py \
-k "tensor_quantization_parameters_are_parsed or quantize_op or
dequantize_op or \
quantized_conv2d or quantized_concat or per_channel_depthwise or \
uint8_reshape_requantize or transpose_conv_with_int32_bias or \
quantized_fully_connected" -v
```
## Result
- 18 `_qnn.op.*` call sites across 12 converters removed.
- 10 new structural-equal tests plus 1 unsupported-boundary test
(per-channel depthwise).
- Targeted quantized TFLite frontend tests pass locally.
- ruff F401/F821/F841 clean.
## Notes for Reviewers
- **Axis remap**: Per-channel weight dequantization requires remapping
`QuantizedDimension()` after TFLite-to-Relax layout transforms (see the
axis
remap table above). This is the primary semantic complexity in this PR —
getting the axis wrong silently produces incorrect per-channel results.
- **INT32 bias dequantization**: TFLite does not store explicit quantization
parameters for INT32 bias tensors. The implicit convention is `bias_scale
=
input_scale × weight_scale` and `bias_zero_point = 0`. The frontend
computes
this at conversion time.
- **Test approach**: All new tests construct minimal TFLite flatbuffers
inline
(no TensorFlow dependency), following the pattern established by the
existing
DENSIFY and StableHLO tests.
## References
- Issue #19534: Support quantized TFLite import in Relax frontend
- TFLite quantization spec:
https://www.tensorflow.org/lite/performance/quantization_spec
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]