Aharrypotter opened a new pull request, #19538:
URL: https://github.com/apache/tvm/pull/19538

   ## Summary
   
   This PR adds quantized TFLite operator import support to the Relax frontend 
by
   replacing all placeholder `_qnn.op.*` references with a QDQ (dequantize →
   float op → quantize) decomposition using the existing `relax.op.quantize` /
   `relax.op.dequantize` infrastructure.
   
   Previously, the frontend raised `NotImplementedError` at the tensor-parsing
   stage whenever a TFLite model contained quantization metadata, making 
quantized
   models completely unreachable. After this PR, quantized tensor metadata
   (`scale`, `zero_point`, and per-axis `QuantizedDimension`) is preserved 
through
   the frontend.
   
   ### Design decision: QDQ vs fused QNN
   
   Relax already has `R.quantize` / `R.dequantize` operators with full C++
   registration, Python API, legalization, and tests.  Rather than introducing 
new
   fused `qnn.conv2d` / `qnn.dense` / `qnn.requantize` ops (which would require 
5
   new Relax operators with C++ attrs, FFI, StructInfo, legalization, and an RFC
   discussion), this PR reuses the existing QDQ ops and decomposes quantized
   TFLite operators into:
   
   ```
   dequantize → float Relax op → quantize
   ```
   
   This has zero C++ changes, keeps the PR small, and defers fused-QNN ops to a
   potential follow-up RFC if backend int8 kernel optimization is needed later.
   
   ### Converters updated
   
   Eighteen `_qnn.op.*` call sites across 12 converter paths:
   
   | Converter | `_qnn.op.*` removed | QDQ replacement |
   |-----------|-------------------|-----------------|
   | `quantize` / `dequantize` helpers | `_qnn.op.quantize`, 
`_qnn.op.dequantize` | `relax.op.quantize` / `relax.op.dequantize` + `axis` |
   | `convert_relu` | `_qnn.op.requantize` | DQ → `R.nn.relu` → Q |
   | `convert_relu6` | `_qnn.op.requantize` | DQ → `R.clip(0,6)` → Q |
   | `convert_relu_n1_to_1` | `_qnn.op.requantize` | DQ → `R.clip(-1,1)` → Q |
   | `convert_reshape` (uint8) | `_qnn.op.requantize` | DQ → `R.reshape` → Q |
   | `_convert_reduce` | `_qnn.op.requantize` | DQ → reduce → Q |
   | `convert_conv` | `_qnn.op.conv2d`, `_qnn.op.requantize` | DQ → 
`R.nn.conv2d` → Q |
   | `convert_fully_connected` | `_qnn.op.dense`, `_qnn.op.requantize` | DQ → 
`R.matmul` → Q |
   | `convert_concatenation` | `_qnn.op.concat` | DQ each → `R.concat` → Q |
   | `convert_transpose_conv` | `_qnn.op.conv2d_transpose`, 
`_qnn.op.requantize` | DQ → `R.nn.conv2d_transpose` → Q |
   | `convert_detection_postprocess` | `_qnn.op.dequantize` (×3) | 
`self.dequantize` |
   
   All `_qnn.op.*` references are eliminated.  The `# ruff: noqa: F821` comment 
is
   removed.
   
   Closes #19534.
   
   ## Changes
   
   ### 1. Preserve tensor quantization metadata
   - `get_tensors()`: read `scale`, `zero_point`, and `QuantizedDimension()` 
from
     TFLite `QuantizationParameters` and store them in 
`TensorWrapper.qnn_params`
     as `{"scale", "zero_point", "axis"}`.
   - Remove the global `NotImplementedError` guard that blocked all quantized
     models at the tensor-parsing stage.
   
   ### 2. Replace `quantize` / `dequantize` frontend helpers
   - `OperatorConverter.quantize()`: `_qnn.op.quantize` → `relax.op.quantize`,
     adding the per-axis `axis` parameter from `qnn_params`.
   - `OperatorConverter.dequantize()`: `_qnn.op.dequantize` →
     `relax.op.dequantize`, adding `axis`.
   
   ### 3. Simple elementwise / reshape / reduce requantize paths
   - `convert_relu`, `convert_relu6`, `convert_relu_n1_to_1`: QNN fused
     activation + requantize → DQ → float op → Q.
   - `convert_reshape` (uint8 with differing input/output qparams): requantize 
on
     integer tensor → DQ → reshape → Q.
   - `_convert_reduce`: int32 cast + requantize → DQ → reduce → Q.
   
   ### 4. Quantized Conv2D via QDQ decomposition
   - `convert_conv`: `_qnn.op.conv2d` + `_qnn.op.requantize` →
     DQ input + DQ weight (per-channel axis remap OC 0→3 for HWIO layout) +
     `R.nn.conv2d` → Q.
   - INT32/INT64 bias dequantized with `input_scale × weight_scale` before 
adding
     to the float conv output.
   - Per-channel depthwise convolution raises `OpNotImplemented` because the
     `[1,KH,KW,C*M] → [KH,KW,C,M]` reshape changes axis semantics.
   
   ### 5. Remaining quantized ops
   - `convert_fully_connected`: `_qnn.op.dense` + `_qnn.op.requantize` →
     DQ input + DQ weight (axis remap OC 0→1) + `R.matmul` → Q.
     INT32/INT64 bias dequantized with `input_scale × weight_scale`.
   - `convert_concatenation`: `_qnn.op.concat` → DQ each input → float concat →
     quantize → activation.
   - `convert_transpose_conv`: `_qnn.op.conv2d_transpose` + 
`_qnn.op.requantize` →
     DQ input + DQ weight (axis remap OHWI→IOHW, OC 0→1) +
     `R.nn.conv2d_transpose` → Q.  INT32/INT64 bias dequantized.
     Also fixes a latent bug: `relax.op.nn.bias_add` → `relax.op.add`.
   - `convert_detection_postprocess`: 3× inline `_qnn.op.dequantize` →
     `self.dequantize`.
   
   ### 6. Cleanup
   - Removed the `# ruff: noqa: F821` suppression comment (zero remaining `_qnn`
     or `_expr` undefined-name references).
   - Removed unused locals (`weight_shape` in FC, `output_tensor_type_str` in
     `convert_quantize`).
   
   ## Axis remap table
   
   Per-channel weight quantization requires remapping `QuantizedDimension()` 
after
   TFLite-to-Relax layout transforms:
   
   | Op | TFLite layout | Relax layout | Transform | Original axis → Remapped 
axis |
   
|----|--------------|-------------|-----------|------------------------------|
   | Conv2D | [OC, KH, KW, IC] | [KH, KW, IC, OC] (HWIO) | transpose (1,2,3,0) 
| 0 → 3 |
   | FullyConnected | [OC, IC] | [IC, OC] | permute_dims [1,0] | 0 → 1 |
   | TransposeConv | [OC, KH, KW, IC] (OHWI) | [IC, OC, KH, KW] (IOHW) | 
permute_dims [3,0,1,2] | 0 → 1 |
   | DepthwiseConv | [1, KH, KW, C×M] | [KH, KW, C, M] (HWOI) | reshape | 
unsupported (OpNotImplemented) |
   
   ## Known Limitations
   
   - **Per-channel depthwise convolution**: raises `OpNotImplemented` because
     the `[1,KH,KW,C×M] → [KH,KW,C,M]` reshape changes per-channel axis
     semantics in a way that `R.dequantize` cannot express directly.
   - **Per-channel bias scale**: INT32/INT64 bias dequantization uses a scalar
     `input_scale × weight_scale`.  For per-channel weights this should be a
     per-channel scale; the current fallback uses a scalar approximation.
   - **No end-to-end numerical validation**: the structural tests verify that 
the
     IR uses the expected QDQ pattern, but there is no numerical comparison
     against a TFLite reference output.  This is deferred to follow-up work.
   
   ## Testing
   
   Successful-conversion tests use manually-built minimal TFLite flatbuffers
   with `tvm.ir.assert_structural_equal`.  Unsupported-boundary tests use
   `pytest.raises`.
   
   ### New tests (11)
   | Test | Covers |
   |------|--------|
   | `test_tensor_quantization_parameters_are_parsed` | per-tensor + per-axis 
metadata parsing |
   | `test_quantize_op_uses_relax_quantize` | TFLite QUANTIZE → `R.quantize` |
   | `test_dequantize_op_uses_relax_dequantize` | TFLite DEQUANTIZE → 
`R.dequantize` |
   | `test_quantized_conv2d_per_tensor_uses_qdq` | Conv2D DQ→conv2d→Q (no bias) 
|
   | `test_quantized_conv2d_with_int32_bias_dequantizes_bias` | Conv2D INT32 
bias DQ |
   | `test_quantized_concat_uses_qdq` | Concat DQ each→concat→Q |
   | `test_per_channel_depthwise_conv_unsupported` | Per-channel depthwise → 
OpNotImplemented |
   | `test_uint8_reshape_requantize_uses_dq_reshape_q` | uint8 RESHAPE 
DQ→reshape→Q |
   | `test_transpose_conv_with_int32_bias_dequantizes_bias` | TRANSPOSE_CONV 
INT32 bias DQ |
   | `test_quantize_op_requantize_uses_dq_q` | TFLite QUANTIZE as requantize → 
DQ→Q |
   | `test_quantized_fully_connected_with_int32_bias_dequantizes_bias` | FC 
INT32 bias DQ |
   
   ### Commands
   
   ```bash
   python -m pytest tests/python/relax/test_frontend_tflite.py \
     -k "tensor_quantization_parameters_are_parsed or quantize_op or 
dequantize_op or \
         quantized_conv2d or quantized_concat or per_channel_depthwise or \
         uint8_reshape_requantize or transpose_conv_with_int32_bias or \
         quantized_fully_connected" -v
   ```
   
   ## Result
   
   - 18 `_qnn.op.*` call sites across 12 converters removed.
   - 10 new structural-equal tests plus 1 unsupported-boundary test 
(per-channel depthwise).
   - Targeted quantized TFLite frontend tests pass locally.
   - ruff F401/F821/F841 clean.
   
   ## Notes for Reviewers
   
   - **Axis remap**: Per-channel weight dequantization requires remapping
     `QuantizedDimension()` after TFLite-to-Relax layout transforms (see the 
axis
     remap table above).  This is the primary semantic complexity in this PR —
     getting the axis wrong silently produces incorrect per-channel results.
   - **INT32 bias dequantization**: TFLite does not store explicit quantization
     parameters for INT32 bias tensors.  The implicit convention is `bias_scale 
=
     input_scale × weight_scale` and `bias_zero_point = 0`.  The frontend 
computes
     this at conversion time.
   - **Test approach**: All new tests construct minimal TFLite flatbuffers 
inline
     (no TensorFlow dependency), following the pattern established by the 
existing
     DENSIFY and StableHLO tests.
   
   ## References
   
   - Issue #19534: Support quantized TFLite import in Relax frontend
   - TFLite quantization spec: 
https://www.tensorflow.org/lite/performance/quantization_spec
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to