On Mon, 29 Jul 2024 11:50:37 +0530
Rakesh Kudurumalla <[email protected]> wrote:
> This patch addresses the issue by introducing a delay
> before acquiring the lock in the loop. This delay allows for better
> availability of the lock, ensuring that show_lcore_stats() can
> periodically update the statistics even when forwarding jobs are running.
>
> Fixes: 204896f8d66c ("examples/l2fwd-jobstats: add new example")
> Cc: [email protected]
>
> Signed-off-by: Rakesh Kudurumalla <[email protected]>
A more long winded AI analysis of why this example is an architectural mess.
---
## 1. The Main Loop is Structurally Convoluted
The `l2fwd_main_loop()` is the core problem. It uses three nested loops with a
spinlock held across the entire active processing path:
```
for (;;) { // outer: infinite
rte_spinlock_lock(&qconf->lock);
do { // middle: context loop
// idle job inner loop
do { // inner: busy-wait polling timers
...
} while (!need_manage);
rte_timer_manage();
} while (stats_read_pending == 0);
rte_spinlock_unlock(&qconf->lock);
rte_pause();
}
```
**Problems:**
The **spinlock is held during all packet processing**. It's only released
briefly when the stats reader sets `stats_read_pending`, which means
`show_lcore_stats()` on the main thread must wait for the datapath to notice
the flag *and* complete a full context cycle before it can acquire the lock.
This creates an unpredictable latency spike for stats collection and makes the
lock contention window essentially unbounded during normal forwarding.
The **inner idle loop is a raw busy-poll** comparing `timer.expire < now` for
every RX timer plus the flush timer on each iteration. With 16 ports per lcore
(MAX_RX_QUEUE_PER_LCORE), that's 17 comparisons per spin. There's no
`rte_pause()` inside the idle loop — only outside the outer spinlock unlock.
The recent patch you may have seen (replacing `rte_pause()` with
`rte_delay_us(10)` after unlock) treats the symptom, not the disease.
The **triple-loop structure makes the control flow genuinely hard to reason
about**. The middle `do...while` tests `stats_read_pending` which is set by a
remote lcore, but the inner loop checks it too (for `need_manage`). The idle
job abort-vs-finish logic (`repeats != 1`) is a subtle optimization that's easy
to misread — if the idle loop body executes exactly once, the job is aborted
rather than finished, meaning the timer was already expired on entry.
## 2. The Flush Job Has an Inverted Condition (Possible Bug)
In `l2fwd_flush_job()`:
```c
if (qconf->next_flush_time[portid] <= now)
continue;
```
This **skips flushing when the time has expired**, which is backwards. You'd
expect to flush when `now >= next_flush_time`, i.e. skip when `now <
next_flush_time`. This means TX buffers only get flushed when they *haven't*
expired yet — the opposite of the intended drain behavior. This could cause
packets to sit in TX buffers far longer than the intended 100μs drain interval,
contributing directly to the "slow" feeling at low traffic rates.
Additionally, `lcore_id` and `qconf` are fetched twice at the top of this
function for no reason.
## 3. MAC Address Manipulation via Aliasing Violation
In `l2fwd_simple_forward()`:
```c
tmp = ð->dst_addr.addr_bytes[0];
*((uint64_t *)tmp) = 0x000000000002 + ((uint64_t)dst_port << 40);
```
This writes 8 bytes into a 6-byte MAC address through a `uint64_t*` cast, which
is both a strict aliasing violation and writes 2 bytes past the MAC address
into the ethertype field. The intent is to write `02:00:00:00:00:xx` as the
destination MAC, and it "works" on little-endian x86 because the excess bytes
happen to overwrite `src_addr` bytes that are immediately overwritten by the
`rte_ether_addr_copy()` below. But this is fragile, non-portable, and undefined
behavior. On a big-endian architecture this produces a completely wrong MAC.
## 4. Global Mutable State Everywhere
The application uses an excessive amount of file-scope globals:
- `l2fwd_ports_eth_addr[RTE_MAX_ETHPORTS]`
- `l2fwd_enabled_port_mask`
- `l2fwd_dst_ports[RTE_MAX_ETHPORTS]`
- `lcore_queue_conf[RTE_MAX_LCORE]`
- `tx_buffer[RTE_MAX_ETHPORTS]`
- `port_statistics[RTE_MAX_ETHPORTS]`
- `l2fwd_pktmbuf_pool`
- `hz`, `drain_tsc`, `timer_period`
These are all `RTE_MAX_ETHPORTS`/`RTE_MAX_LCORE` sized arrays, statically
allocated regardless of actual usage. `port_statistics` is accessed from both
the datapath lcores (writes) and the stats display callback (reads) with **no
synchronization** — the stats numbers can tear on platforms without naturally
atomic 64-bit loads. The `memset(&port_statistics, 0, ...)` inside the per-port
init loop also clears all port stats on each port init, not just the current
port's.
## 5. The Timer-Driven Architecture Adds Complexity Without Proportional Benefit
The original l2fwd example uses a simple poll loop: check TSC, drain if needed,
burst RX, forward. This jobstats variant replaces that with:
- `rte_timer` for each RX port + flush timer per lcore
- `rte_jobstats` wrappers around each timer callback
- A custom `l2fwd_job_update_cb` that adjusts polling frequency via a
hysteresis algorithm with asymmetric step sizes (`UPDATE_STEP_UP = 1`,
`UPDATE_STEP_DOWN = 32`)
The adaptive polling idea (poll less when idle, poll more when busy) is sound
in theory, but the implementation has issues. The step-up granularity of 1 TSC
tick is so small relative to step-down of 32 that recovery from an idle period
is extremely slow. If the system goes idle for a period and the polling
interval ramps up, it takes 32x as many busy iterations to ramp back down. The
target of `MAX_PKT_BURST` (32 packets) as the "optimal" value is hard-coded
with no way to tune it.
The `rte_timer_manage()` call itself is not free — it walks the timer skip list
on each invocation, adding overhead that a direct TSC comparison in plain l2fwd
avoids.
## 6. The Double RX Burst is Questionable
```c
total_nb_rx = rte_eth_rx_burst(portid, 0, pkts_burst, MAX_PKT_BURST);
// ... forward all ...
if (total_nb_rx == MAX_PKT_BURST) {
const uint16_t nb_rx = rte_eth_rx_burst(portid, 0, pkts_burst,
MAX_PKT_BURST);
// ... forward these too ...
}
```
The comment says this "allows rte_jobstats logic to see if this function must
be called more frequently." The total is fed to `rte_jobstats_finish()` which
compares against the target. But this means `total_nb_rx` can be up to 64 while
the target is 32, and the jobstats update callback sees `result > target` and
decreases the period. The asymmetric step sizes make this interaction even more
unpredictable. This also means the prefetch pattern is suboptimal — the first
batch is prefetched and forwarded sequentially (prefetch one, forward one),
which doesn't give the prefetch enough lead time to be useful.
## 7. Argument Parsing Edge Cases
`l2fwd_parse_portmask()` returns `unsigned long` cast through `int` (0 means
error), so a mask of 0 is indistinguishable from a parse error. On 64-bit
systems where `unsigned long` is 64 bits but `l2fwd_enabled_port_mask` is
`uint32_t`, masks above 0xFFFFFFFF silently truncate. The port mask check `(1
<< portid)` throughout the code is undefined behavior for `portid >= 32` since
`1` is an `int`.
## 8. No Graceful Shutdown
There's no signal handler. The application can only be killed with
SIGKILL/SIGTERM, leaving ports in an undefined state (promiscuous mode enabled,
potentially with traffic still queued in TX buffers). The `rte_eal_cleanup()`
at the end is unreachable during normal operation since `l2fwd_main_loop()`
never returns.
## Summary Recommendations
If you're looking at cleaning this up or using it as a basis for review
guidelines:
The flush condition inversion should be treated as a bug fix. The main loop
should be restructured to separate the stats synchronization concern from the
packet processing path — a lock-free stats snapshot approach (double-buffering
or seqlock) would eliminate the spinlock entirely. The MAC address write needs
to use `rte_ether_addr` properly. The adaptive polling logic needs tunable
parameters and a more balanced step function. And ideally, the example should
be reconciled with plain l2fwd to show clearly what jobstats *adds* rather than
being a near-complete fork with structural divergence.