On Sun, 22 Mar 2026 14:46:19 +0100
Vincent Jardin <[email protected]> wrote:
> This series adds per-queue Tx data-rate limiting to the mlx5 PMD using
> hardware packet pacing (PP), and a symmetric rte_eth_get_queue_rate_limit()
> ethdev API to read back the configured rate.
>
> Each Tx queue can be assigned an individual rate (in Mbps) at runtime via
> rte_eth_set_queue_rate_limit(). The mlx5 implementation allocates a PP
> context per queue from the HW rate table, programs the PP index into the
> SQ via modify_sq, and relies on the kernel to share identical rates
> across PP contexts to conserve table entries. A PMD-specific API exposes
> per-queue PP diagnostics and rate table capacity.
>
> Patch breakdown:
>
> 01/10 doc/nics/mlx5: fix stale packet pacing documentation
> 02/10 common/mlx5: query packet pacing rate table capabilities
> 03/10 common/mlx5: extend SQ modify to support rate limit update
> 04/10 net/mlx5: add per-queue packet pacing infrastructure
> 05/10 net/mlx5: support per-queue rate limiting
> 06/10 net/mlx5: add burst pacing devargs
> 07/10 net/mlx5: add testpmd command to query per-queue rate limit
> 08/10 ethdev: add getter for per-queue Tx rate limit
> 09/10 net/mlx5: implement per-queue Tx rate limit getter
> 10/10 net/mlx5: add rate table capacity query API
>
> Release notes for the new ethdev API and mlx5 per-queue rate
> limiting can be added to a release_26_07.rst once the file is
> created at the start of the 26.07 development cycle.
>
> Changes since v3:
>
> Addressed review feedback from Stephen and Slava (nvidia/Mellanox).
>
> Patch 02/10 (query caps):
> - Added Acked-by: Viacheslav Ovsiienko
>
> Patch 03/10 (SQ modify):
> - Define MLX5_MODIFY_SQ_IN_MODIFY_BITMASK_PACKET_PACING_RATE_LIMIT_INDEX
> enum in mlx5_prm.h, following the MLX5_MODIFY_RQ_IN_MODIFY_xxx pattern
> - Use read-modify-write for modify_bitmask (MLX5_GET64 | OR | MLX5_SET64)
> instead of direct overwrite, for forward compatibility
>
> Patch 04/10 (PP infrastructure):
> - Rename struct member and parameters from "rl" to "rate_limit"
> for consistency with codebase naming style
> - Replace MLX5_ASSERT(rate_mbps > 0) with runtime check returning
> -EINVAL in non-debug builds
> - Move mlx5_txq_free_pp_rate_limit() to after txq_obj_release() in
> mlx5_txq_release() — destroy the SQ before freeing the PP index
> it references
> - Clarify commit message: distinct PP handle per queue (for cleanup)
> but kernel shares the same pp_id for identical rate parameters
>
> Patch 05/10 (set rate):
> - Fix obj->sq vs obj->sq_obj.sq: use obj->sq_obj.sq from the start
> for non-hairpin queues (was introduced in patch 07 in v3, breaking
> git bisect)
> - Move all variable declarations to block top (sq_devx,
> new_rate_limit)
> - Add queue state check: reject set_queue_rate_limit if queue is not
> STARTED (SQ not in RDY state)
> - Update mlx5 feature matrix: Rate limitation = Y
> - Add Per-Queue Tx Rate Limiting documentation section in mlx5.rst
> covering DevX requirement, hardware support, rate table sharing,
> and testpmd usage
>
> Patch 06/10 (burst devargs):
> - Remove burst_upper_bound/typical_packet_size from Clock Queue
> path (mlx5_txpp_alloc_pp_index) — Clock Queue uses WQE rate
> pacing and does not need these parameters
> - Update commit message and documentation accordingly
>
> Patch 07/10 (testpmd + PMD query):
> - sq_obj.sq accessor change moved to patch 05 (see above)
> - sq_devx declaration moved to block top
>
> Patch 08/10 (ethdev getter) — split from v3 patch 08:
> - Split into ethdev API (this patch) and mlx5 driver (patch 09)
> - Add rte_eth_trace_get_queue_rate_limit() trace point matching
> the existing setter pattern
>
> Patch 09/10 — NEW (was part of v3 patch 08):
> - mlx5 driver implementation of get_queue_rate_limit callback,
> split out per Slava's request
>
> Patch 10/10 (rate table query):
> - Rename struct field "used" to "port_used" to clarify per-port
> scope
> - Strengthen Doxygen: rate table is a global shared HW resource
> (firmware, kernel, other DPDK instances may consume entries);
> port_used is a lower bound
> - Document PP sharing behavior with flags=0
> - Note that applications should aggregate across ports for
> device-wide visibility
>
> Changes since v2:
>
> Addressed review feedback from Stephen Hemminger:
>
> Patch 04: cleaned redundant cast parentheses on (struct mlx5dv_pp *)
> Patch 04: consolidated dv_alloc_pp call onto one line
> Patch 05+08: removed redundant queue_idx bounds checks from driver
> callbacks — ethdev layer is the single validation point
> Patch 07: added generic testpmd command: show port <id> queue <id> rate
> Patch 08+10: removed release notes from release_26_03.rst (targets 26.07)
> Patch 10: use MLX5_MEM_SYS | MLX5_MEM_ZERO for heap allocation
> Patch 10: consolidated packet_pacing_rate_table_size onto one line
>
> Changes since v1:
>
> Patch 01: Acked-by Viacheslav Ovsiienko
> Patch 04: rate bounds validation, uint64_t overflow fix, remove
> early PP free
> Patch 05: PP leak fix (temp struct pattern), rte_errno in error paths
> Patch 07: inverted rte_eth_tx_queue_is_valid() check
> Patch 10: stack array replaced with heap, per-port scope documented
>
> Testing:
>
> - Build: GCC, no warnings
> - Hardware: ConnectX-6 Dx
> - DevX path (default): set/get/disable rate limiting verified
> - Verbs path (dv_flow_en=0): returns -EINVAL cleanly (SQ DevX
> object not available), no crash
>
> Vincent Jardin (10):
> doc/nics/mlx5: fix stale packet pacing documentation
> common/mlx5: query packet pacing rate table capabilities
> common/mlx5: extend SQ modify to support rate limit update
> net/mlx5: add per-queue packet pacing infrastructure
> net/mlx5: support per-queue rate limiting
> net/mlx5: add burst pacing devargs
> net/mlx5: add testpmd command to query per-queue rate limit
> ethdev: add getter for per-queue Tx rate limit
> net/mlx5: implement per-queue Tx rate limit getter
> net/mlx5: add rate table capacity query API
>
> Vincent Jardin (10):
> doc/nics/mlx5: fix stale packet pacing documentation
> common/mlx5: query packet pacing rate table capabilities
> common/mlx5: extend SQ modify to support rate limit update
> net/mlx5: add per-queue packet pacing infrastructure
> net/mlx5: support per-queue rate limiting
> net/mlx5: add burst pacing devargs
> net/mlx5: add testpmd command to query per-queue rate limit
> ethdev: add getter for per-queue Tx rate limit
> net/mlx5: implement per-queue Tx rate limit getter
> net/mlx5: add rate table capacity query API
>
> app/test-pmd/cmdline.c | 69 ++++++++++
> doc/guides/nics/features/mlx5.ini | 1 +
> doc/guides/nics/mlx5.rst | 180 ++++++++++++++++++++++-----
> drivers/common/mlx5/mlx5_devx_cmds.c | 23 ++++
> drivers/common/mlx5/mlx5_devx_cmds.h | 14 ++-
> drivers/common/mlx5/mlx5_prm.h | 7 ++
> drivers/net/mlx5/mlx5.c | 46 +++++++
> drivers/net/mlx5/mlx5.h | 13 ++
> drivers/net/mlx5/mlx5_testpmd.c | 93 ++++++++++++++
> drivers/net/mlx5/mlx5_tx.c | 104 +++++++++++++++-
> drivers/net/mlx5/mlx5_tx.h | 5 +
> drivers/net/mlx5/mlx5_txpp.c | 84 +++++++++++++
> drivers/net/mlx5/mlx5_txq.c | 149 ++++++++++++++++++++++
> drivers/net/mlx5/rte_pmd_mlx5.h | 74 +++++++++++
> lib/ethdev/ethdev_driver.h | 7 ++
> lib/ethdev/ethdev_trace.h | 9 ++
> lib/ethdev/ethdev_trace_points.c | 3 +
> lib/ethdev/rte_ethdev.c | 35 ++++++
> lib/ethdev/rte_ethdev.h | 24 ++++
> 19 files changed, 906 insertions(+), 33 deletions(-)
>
I didn't see anything wrong, but AI nit picking did.
---
## Review: [PATCH v4 01-10/10] mlx5 per-queue rate limiting
### Patch 05/10 — `net/mlx5: support per-queue rate limiting`
**Error (Correctness): Data truncation in `rate_kbps` → PRM `rate_limit` field**
In `mlx5_txq_alloc_pp_rate_limit()`, `rate_kbps` is computed as
`(uint64_t)rate_mbps * 1000` and then stored with `MLX5_SET(..., rate_limit,
(uint32_t)rate_kbps)`. When `qos->packet_pacing_max_rate` is zero (HW doesn't
report a maximum), the range check is skipped, and any `rate_mbps` value above
~4,294,967 will silently overflow the `uint32_t` cast, programming a completely
wrong rate into the hardware.
For example, `rate_mbps = 5000000` (5 Tbps — unlikely but allowed by the
`uint32_t` parameter type) yields `rate_kbps = 5,000,000,000` which truncates
to `705,032,704` kbps.
Suggested fix: add an explicit check that `rate_kbps` fits in `uint32_t` before
the cast, or cap it:
```c
if (rate_kbps > UINT32_MAX) {
DRV_LOG(ERR, "Rate %u Mbps overflows PRM rate_limit field.", rate_mbps);
rte_errno = ERANGE;
return -ERANGE;
}
```
**Warning (Correctness): Missing NULL check on `txq_data` in
`rte_pmd_mlx5_txq_rate_limit_query()` (patch 07)**
After `rte_eth_tx_queue_is_valid()` confirms the index is in range, the code
does:
```c
txq_data = (*priv->txqs)[queue_id];
txq_ctrl = container_of(txq_data, struct mlx5_txq_ctrl, txq);
```
If the queue index is valid but the queue has been released (entry is NULL),
`container_of` on a NULL pointer produces an invalid pointer, and subsequent
dereferences are undefined behavior. The setter in patch 05 checks
`(*priv->txqs)[queue_idx] == NULL` but the query function does not.
Suggested fix: add a NULL check on `txq_data` before the `container_of`,
similar to what the setter does.
The same issue exists in `rte_pmd_mlx5_pp_rate_table_query()` (patch 10) but
there the loop body does check `(*priv->txqs)[i] == NULL` and skips, so that
one is fine.
### Patch 09/10 — `net/mlx5: implement per-queue Tx rate limit getter`
**Warning (Correctness): Missing NULL check on queue data in
`mlx5_get_queue_rate_limit()`**
The getter checks `priv->txqs == NULL || (*priv->txqs)[queue_idx] == NULL`
which is good. However, the ethdev layer (`rte_eth_get_queue_rate_limit()` in
patch 08) only checks `queue_idx >= dev->data->nb_tx_queues`, not whether
`priv->txqs` is allocated. The driver-side check handles this, so this is
actually fine — noting for completeness.
### Patch 01/10 — `doc/nics/mlx5: fix stale packet pacing documentation`
**Warning: Missing `Cc: [email protected]`**
The patch has a `Fixes:` tag referencing a previous commit but is missing `Cc:
[email protected]` for backport consideration, even though it's a
documentation-only fix.
### Patches 02, 03, 04, 06, 08, 10
No correctness issues found. The code is well-structured with proper error
handling, resource cleanup on failure paths, and appropriate validation.
### General observations
The series is well organized with clean separation of concerns across patches.
Error paths in `mlx5_set_queue_rate_limit()` properly clean up the temporary
`new_rate_limit` PP allocation on modify-SQ failure, and `mlx5_txq_release()`
correctly frees the rate limit context on queue teardown. The `rte_errno =
-ret` pattern after `mlx5_devx_cmd_modify_sq()` is consistent with how other
mlx5 callers handle that function's negative errno return convention.