l2fwd-jobstats: add delay to show stats

Stephen Hemminger Mon, 02 Mar 2026 08:23:48 -0800

On Mon, 29 Jul 2024 11:50:37 +0530
Rakesh Kudurumalla <[email protected]> wrote:


> This patch addresses the issue by introducing a delay
> before acquiring the lock in the loop. This delay allows for better
> availability of the lock, ensuring that show_lcore_stats() can
> periodically update the statistics even when forwarding jobs are running.
> 
> Fixes: 204896f8d66c ("examples/l2fwd-jobstats: add new example")
> Cc: [email protected]
> 
> Signed-off-by: Rakesh Kudurumalla <[email protected]>

A more long winded AI analysis of why this example is an architectural mess.


---

## 1. The Main Loop is Structurally Convoluted

The `l2fwd_main_loop()` is the core problem. It uses three nested loops with a 
spinlock held across the entire active processing path:

```
for (;;) {                          // outer: infinite
    rte_spinlock_lock(&qconf->lock);
    do {                            // middle: context loop
        // idle job inner loop
        do {                        // inner: busy-wait polling timers
            ...
        } while (!need_manage);
        rte_timer_manage();
    } while (stats_read_pending == 0);
    rte_spinlock_unlock(&qconf->lock);
    rte_pause();
}
```

**Problems:**

The **spinlock is held during all packet processing**. It's only released 
briefly when the stats reader sets `stats_read_pending`, which means 
`show_lcore_stats()` on the main thread must wait for the datapath to notice 
the flag *and* complete a full context cycle before it can acquire the lock. 
This creates an unpredictable latency spike for stats collection and makes the 
lock contention window essentially unbounded during normal forwarding.

The **inner idle loop is a raw busy-poll** comparing `timer.expire < now` for 
every RX timer plus the flush timer on each iteration. With 16 ports per lcore 
(MAX_RX_QUEUE_PER_LCORE), that's 17 comparisons per spin. There's no 
`rte_pause()` inside the idle loop — only outside the outer spinlock unlock. 
The recent patch you may have seen (replacing `rte_pause()` with 
`rte_delay_us(10)` after unlock) treats the symptom, not the disease.

The **triple-loop structure makes the control flow genuinely hard to reason 
about**. The middle `do...while` tests `stats_read_pending` which is set by a 
remote lcore, but the inner loop checks it too (for `need_manage`). The idle 
job abort-vs-finish logic (`repeats != 1`) is a subtle optimization that's easy 
to misread — if the idle loop body executes exactly once, the job is aborted 
rather than finished, meaning the timer was already expired on entry.

## 2. The Flush Job Has an Inverted Condition (Possible Bug)

In `l2fwd_flush_job()`:

```c
if (qconf->next_flush_time[portid] <= now)
    continue;
```

This **skips flushing when the time has expired**, which is backwards. You'd 
expect to flush when `now >= next_flush_time`, i.e. skip when `now < 
next_flush_time`. This means TX buffers only get flushed when they *haven't* 
expired yet — the opposite of the intended drain behavior. This could cause 
packets to sit in TX buffers far longer than the intended 100μs drain interval, 
contributing directly to the "slow" feeling at low traffic rates.

Additionally, `lcore_id` and `qconf` are fetched twice at the top of this 
function for no reason.

## 3. MAC Address Manipulation via Aliasing Violation

In `l2fwd_simple_forward()`:

```c
tmp = &eth->dst_addr.addr_bytes[0];
*((uint64_t *)tmp) = 0x000000000002 + ((uint64_t)dst_port << 40);
```

This writes 8 bytes into a 6-byte MAC address through a `uint64_t*` cast, which 
is both a strict aliasing violation and writes 2 bytes past the MAC address 
into the ethertype field. The intent is to write `02:00:00:00:00:xx` as the 
destination MAC, and it "works" on little-endian x86 because the excess bytes 
happen to overwrite `src_addr` bytes that are immediately overwritten by the 
`rte_ether_addr_copy()` below. But this is fragile, non-portable, and undefined 
behavior. On a big-endian architecture this produces a completely wrong MAC.

## 4. Global Mutable State Everywhere

The application uses an excessive amount of file-scope globals:

- `l2fwd_ports_eth_addr[RTE_MAX_ETHPORTS]`
- `l2fwd_enabled_port_mask`
- `l2fwd_dst_ports[RTE_MAX_ETHPORTS]`
- `lcore_queue_conf[RTE_MAX_LCORE]`
- `tx_buffer[RTE_MAX_ETHPORTS]`
- `port_statistics[RTE_MAX_ETHPORTS]`
- `l2fwd_pktmbuf_pool`
- `hz`, `drain_tsc`, `timer_period`

These are all `RTE_MAX_ETHPORTS`/`RTE_MAX_LCORE` sized arrays, statically 
allocated regardless of actual usage. `port_statistics` is accessed from both 
the datapath lcores (writes) and the stats display callback (reads) with **no 
synchronization** — the stats numbers can tear on platforms without naturally 
atomic 64-bit loads. The `memset(&port_statistics, 0, ...)` inside the per-port 
init loop also clears all port stats on each port init, not just the current 
port's.

## 5. The Timer-Driven Architecture Adds Complexity Without Proportional Benefit

The original l2fwd example uses a simple poll loop: check TSC, drain if needed, 
burst RX, forward. This jobstats variant replaces that with:

- `rte_timer` for each RX port + flush timer per lcore
- `rte_jobstats` wrappers around each timer callback
- A custom `l2fwd_job_update_cb` that adjusts polling frequency via a 
hysteresis algorithm with asymmetric step sizes (`UPDATE_STEP_UP = 1`, 
`UPDATE_STEP_DOWN = 32`)

The adaptive polling idea (poll less when idle, poll more when busy) is sound 
in theory, but the implementation has issues. The step-up granularity of 1 TSC 
tick is so small relative to step-down of 32 that recovery from an idle period 
is extremely slow. If the system goes idle for a period and the polling 
interval ramps up, it takes 32x as many busy iterations to ramp back down. The 
target of `MAX_PKT_BURST` (32 packets) as the "optimal" value is hard-coded 
with no way to tune it.

The `rte_timer_manage()` call itself is not free — it walks the timer skip list 
on each invocation, adding overhead that a direct TSC comparison in plain l2fwd 
avoids.

## 6. The Double RX Burst is Questionable

```c
total_nb_rx = rte_eth_rx_burst(portid, 0, pkts_burst, MAX_PKT_BURST);
// ... forward all ...
if (total_nb_rx == MAX_PKT_BURST) {
    const uint16_t nb_rx = rte_eth_rx_burst(portid, 0, pkts_burst, 
MAX_PKT_BURST);
    // ... forward these too ...
}
```

The comment says this "allows rte_jobstats logic to see if this function must 
be called more frequently." The total is fed to `rte_jobstats_finish()` which 
compares against the target. But this means `total_nb_rx` can be up to 64 while 
the target is 32, and the jobstats update callback sees `result > target` and 
decreases the period. The asymmetric step sizes make this interaction even more 
unpredictable. This also means the prefetch pattern is suboptimal — the first 
batch is prefetched and forwarded sequentially (prefetch one, forward one), 
which doesn't give the prefetch enough lead time to be useful.

## 7. Argument Parsing Edge Cases

`l2fwd_parse_portmask()` returns `unsigned long` cast through `int` (0 means 
error), so a mask of 0 is indistinguishable from a parse error. On 64-bit 
systems where `unsigned long` is 64 bits but `l2fwd_enabled_port_mask` is 
`uint32_t`, masks above 0xFFFFFFFF silently truncate. The port mask check `(1 
<< portid)` throughout the code is undefined behavior for `portid >= 32` since 
`1` is an `int`.

## 8. No Graceful Shutdown

There's no signal handler. The application can only be killed with 
SIGKILL/SIGTERM, leaving ports in an undefined state (promiscuous mode enabled, 
potentially with traffic still queued in TX buffers). The `rte_eal_cleanup()` 
at the end is unreachable during normal operation since `l2fwd_main_loop()` 
never returns.

## Summary Recommendations

If you're looking at cleaning this up or using it as a basis for review 
guidelines:

The flush condition inversion should be treated as a bug fix. The main loop 
should be restructured to separate the stats synchronization concern from the 
packet processing path — a lock-free stats snapshot approach (double-buffering 
or seqlock) would eliminate the spinlock entirely. The MAC address write needs 
to use `rte_ether_addr` properly. The adaptive polling logic needs tunable 
parameters and a more balanced step function. And ideally, the example should 
be reconciled with plain l2fwd to show clearly what jobstats *adds* rather than 
being a near-complete fork with structural divergence.

Re: [PATCH v2 1/1] examples/l2fwd-jobstats: add delay to show stats

Reply via email to