topology: introduce topology-aware lcore grouping

Varghese, Vipin Thu, 28 May 2026 00:17:59 -0700

AMD General

Sharing v6 with fixes and cache-id fetch shortly.


> -----Original Message-----
> From: Stephen Hemminger <[email protected]>
> Sent: Wednesday, April 15, 2026 1:52 AM
> To: Varghese, Vipin <[email protected]>
> Cc: [email protected]; Tummala, Sivaprasad <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v5 0/3] eal/topology: introduce topology-aware lcore 
> grouping
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> On Wed, 15 Apr 2026 01:08:18 +0530
> Vipin Varghese <[email protected]> wrote:
>
> > This series introduces a topology library that groups DPDK lcores
> > based on CPU cache hierarchy and NUMA topology. The goal is to provide
> > a stable and explicit API that allows applications to select lcores
> > with better locality and cache sharing characteristics.
> >
> > The series includes:
> >   - EAL support for topology discovery using hwloc and domain-based lcore
> >     grouping (L1/L2/L3/L4/NUMA)
> >   - Topology-aware test cases validating API behavior and edge conditions
> >   - Programmer’s guide describing the topology library and APIs
> >
> > The API is marked experimental and does not change existing lcore
> > behavior unless explicitly used by the application.
> >
> > Changes in v5:
> >   - Addressed review comments from v4
> >   - Fixed ARM cross-compilation issues
> >   - Cleaned up domain iteration and error handling
> >   - Updated tests to cover domain edge cases
> >   - Documentation refinements and API usage clarification
> >
> > Changes in v4:
> >   - Corrected domain selection semantics
> >   - Updated example usage
> >   - Fixed naming and typo issues
> >
> > Changes in v3:
> >   - Fixed macro naming (USE_NO_TOPOLOGY)
> >   - Minor cleanups based on early feedback
> >
> > Tested on:
> >   - AMD EPYC (Milan, Genoa, Siena, Turin, Turin-Dense, Sorano)
> >   - Intel Xeon (SPR-SP, GNR-SP)
> >   - ARM Ampere
> >   - NVIDIA Grace Superchip
> >
> > Dependencies:
> >   - hwloc-dev (tested with 2.10.0)
> >
> > Patch breakdown:
> >   1/3 eal/topology: add topology grouping for lcores
> >   2/3 app: add topology-aware test cases
> >   3/3 doc: add topology library documentation
> >
> > Future Work:
> >  - integrate into examples
> >   -- hellowrld: ready
> >   -- pkt-distributor: in-progress
> >   -- l2fwd: ready
> >   -- l3fwd: to start
> >   -- eventdevpipeline: PoC ready
> >  - integrate topology test
> >   -- crypto: yet to start
> >   -- compression: yet to start
> >   -- dma: PoC ready
> >  - add new features for
> >   -- PQoS: yet to start
> >   -- Data Injection: PoC with BRDCM Thor-2 ready
> >
> > Tested OS: Linux only, need help with BSD and Windows
> >
> > Tested with and without hwloc-dev library for
> >  - Ampere, aarch64, Neoverse-N1, NUMA-2, 256 CPU threads
> >  - Grace superchip, aarch64, Neoverse-V2, NUMA-2, 144 CPU threads
> >  - Intel GNR-SP, 6767P, NUMA-2, 256 Threads
> >  - AMD EPYC Siena, 8534P, NUMA-1, 128 Threads
> >  - AMD EPYC Sorano, 8635P, NUMA-1, 168 Threads
> >
> > Signed-off-by: Vipin Varghese <[email protected]> ``
> >
> > Vipin Varghese (3):
> >   eal/topology: add Topology grouping for lcores
> >   app: add topology aware test case
> >   doc: add new section topology
> >
> >  app/test/meson.build                   |   1 +
> >  app/test/test_ring_perf.c              | 416 +++++++++++++-
> >  app/test/test_stack_perf.c             | 409 ++++++++++++++
> >  app/test/test_topology.c               | 676 ++++++++++++++++++++++
> >  config/meson.build                     |  18 +
> >  doc/api/doxy-api-index.md              |   1 +
> >  doc/guides/prog_guide/index.rst        |   3 +-
> >  doc/guides/prog_guide/topology_lib.rst | 155 +++++
> >  lib/eal/common/eal_private.h           |  74 +++
> >  lib/eal/common/eal_topology.c          | 747 +++++++++++++++++++++++++
> >  lib/eal/common/meson.build             |   1 +
> >  lib/eal/freebsd/eal.c                  |  10 +-
> >  lib/eal/include/meson.build            |   1 +
> >  lib/eal/include/rte_topology.h         | 255 +++++++++
> >  lib/eal/linux/eal.c                    |   7 +
> >  lib/eal/meson.build                    |   4 +
> >  16 files changed, 2773 insertions(+), 5 deletions(-)  create mode
> > 100644 app/test/test_topology.c  create mode 100644
> > doc/guides/prog_guide/topology_lib.rst
> >  create mode 100644 lib/eal/common/eal_topology.c  create mode 100644
> > lib/eal/include/rte_topology.h
> >
>
> AI review reported lots of possible correctness problems:
>
>
> # DPDK Patch Review - bundle-1843-topology-v5.mbox
>
> ## Overview
>
> This patch series introduces topology awareness to DPDK, allowing 
> applications to
> select logical cores based on CPU cache and I/O topology. The series adds a 
> new
> `rte_topology` library with APIs to query and enumerate logical cores within 
> topology
> domains (L1/L2/L3/L4/NUMA).
>
> ---
>
> ## Correctness Bugs (HIGHEST PRIORITY - report at >=50% confidence)
>
> ### Patch 1/3: lib/eal/common/eal_topology.c
>
> #### **Error: Resource leak on `eal_topology_map_layer()` failure**
>
> In `rte_eal_topology_init()`, if any call to `eal_topology_map_layer()` fails 
> (returns <
> 0), the code calls `rte_eal_topology_release()` which frees previously 
> allocated
> layers. However, the `hwloc_topology_t` handle `topo_cnfg.topology` is **NOT**
> destroyed before returning. This leaks the hwloc topology handle.
>
> **Location:** `lib/eal/common/eal_topology.c:685-712`
>
> ```c
> for (int i = 0; i < 5; i++) {
>     *layers[i].count = hwloc_get_nbobjs_by_depth(topo_cnfg.topology,
> layers[i].depth);
>     if (eal_topology_map_layer(topo_cnfg.topology, layers[i].depth, 
> layers[i].count,
>         layers[i].ptr, layers[i].total_cores, layers[i].name) < 0) {
>         rte_eal_topology_release();  /* frees layer memory */
>         return -1;  /* BUG: topo_cnfg.topology NOT destroyed here */
>     }
> }
>
> hwloc_topology_destroy(topo_cnfg.topology);  /* only reached on success */
> topo_cnfg.topology = NULL; ```
>
> **Fix:** Destroy the topology before returning on error:
>
> ```c
> if (eal_topology_map_layer(...) < 0) {
>     hwloc_topology_destroy(topo_cnfg.topology);
>     topo_cnfg.topology = NULL;
>     rte_eal_topology_release();
>     return -1;
> }
> ```
>
> ---
>
> #### **Error: Potential use-after-free in `eal_topology_map_layer()` on 
> partial
> allocation failure**
>
> In `eal_topology_map_layer()`, the code allocates `dm->cores` for each 
> domain. If a
> later allocation in the same loop iteration fails (e.g., for domain `j+1`), 
> the function
> returns -1 immediately **without freeing `dm->cores` already allocated for 
> earlier
> domains**. The caller (`rte_eal_topology_init()`) then calls
> `rte_eal_topology_release()`, which expects `layer_ptr[j]` to be non-NULL but
> `layer_ptr[j]->cores` may be uninitialized or garbage if the allocation 
> failed before
> that point. This can cause a use-after-free or double-free when
> `rte_eal_topology_release()` calls `rte_free(d->map[i]->cores)`.
>
> **Location:** `lib/eal/common/eal_topology.c:537-557`
>
> ```c
> for (uint16_t j = 0; j < *layer_cnt; j++) {
>     hwloc_obj_t obj = hwloc_get_obj_by_depth(topology, depth, j);
>     int cpu_count = hwloc_bitmap_weight(obj->cpuset);
>     if (cpu_count == -1)
>         continue;
>
>     struct core_domain_mapping *dm =
>         rte_zmalloc(NULL, sizeof(struct core_domain_mapping), 0);
>     if (!dm)
>         return -1;  /* BUG: leaks layer_ptr array allocated by caller */
>
>     (*layer_ptr)[j] = dm;
>     CPU_ZERO(&dm->core_set);
>     dm->core_count = 0;
>
>     dm->cores = rte_malloc(NULL, sizeof(uint16_t) * cpu_count, 0);
>     if (!dm->cores)
>         return -1;  /* BUG: leaks dm (just allocated) and previous entries */ 
> } ```
>
> **Fix:** On allocation failure, free all previously allocated entries before 
> returning:
>
> ```c
> if (!dm) {
>     /* Free all previously allocated entries */
>     for (uint16_t k = 0; k < j; k++) {
>         if ((*layer_ptr)[k]) {
>             rte_free((*layer_ptr)[k]->cores);
>             rte_free((*layer_ptr)[k]);
>         }
>     }
>     rte_free(*layer_ptr);
>     return -1;
> }
>
> /* Same cleanup for dm->cores allocation failure */ if (!dm->cores) {
>     rte_free(dm);
>     for (uint16_t k = 0; k < j; k++) {
>         if ((*layer_ptr)[k]) {
>             rte_free((*layer_ptr)[k]->cores);
>             rte_free((*layer_ptr)[k]);
>         }
>     }
>     rte_free(*layer_ptr);
>     return -1;
> }
> ```
>
> ---
>
> #### **Error: NULL pointer dereference in `eal_topology_map_layer()` when
> `hwloc_get_obj_by_depth()` returns NULL**
>
> In the second loop of `eal_topology_map_layer()` (lines 560-620), the code 
> calls
> `hwloc_get_obj_by_depth(topology, depth, j)` and checks `if (!obj ||
> hwloc_bitmap_iszero(obj->cpuset))` to skip NULL objects. However, in the 
> **inner
> loop** (lines 579-618), the code calls `hwloc_get_obj_by_depth(topology, 
> depth, k)`
> and assigns it to `obj_core` but does **NOT** check if `obj_core` is NULL 
> before
> calling `hwloc_bitmap_weight(obj_core->cpuset)` on line 581. If
> `hwloc_get_obj_by_depth()` returns NULL for domain `k`, this will dereference 
> a
> NULL pointer.
>
> **Location:** `lib/eal/common/eal_topology.c:579-582`
>
> ```c
> for (uint16_t k = 0; k < *layer_cnt; k++) {
>     hwloc_obj_t obj_core = hwloc_get_obj_by_depth(topology, depth, k);
>     int cpu_count_core = hwloc_bitmap_weight(obj_core->cpuset);  /* NULL 
> deref if
> obj_core == NULL */
>     if (cpu_count_core == -1)
>         continue;
> ```
>
> **Fix:** Check `obj_core` before dereferencing:
>
> ```c
> hwloc_obj_t obj_core = hwloc_get_obj_by_depth(topology, depth, k); if 
> (!obj_core)
>     continue;
> int cpu_count_core = hwloc_bitmap_weight(obj_core->cpuset);
> if (cpu_count_core == -1)
>     continue;
> ```
>
> ---
>
> #### **Error: Incorrect second argument to
> `rte_topo_get_nth_lcore_from_domain()` in `get_same_l1_domains()`
> (test_ring_perf.c and test_stack_perf.c)**
>
> In both `app/test/test_ring_perf.c:290` and `app/test/test_stack_perf.c:258`, 
> the
> function `get_same_l1_domains()` calls:
>
> ```c
> id2 = rte_topo_get_nth_lcore_from_domain(domain, 0, 0,
> RTE_TOPO_DOMAIN_L1); ```
>
> The second argument (`lcore_pos`) is `0`, which is the same as for `id1`. 
> This will
> assign **the same lcore** to both `id1` and `id2`, causing the subsequent 
> check `if
> (id1 == id2) return 3;` to always trigger. This is a logic error: the intent 
> is clearly to
> get two **different** lcores from the same domain.
>
> **Location:** `app/test/test_ring_perf.c:287-290` and
> `app/test/test_stack_perf.c:255-258`
>
> **Fix:** Use position `1` for the second lcore:
>
> ```c
> id1 = rte_topo_get_nth_lcore_from_domain(domain, 0, 0,
> RTE_TOPO_DOMAIN_L1);
> id2 = rte_topo_get_nth_lcore_from_domain(domain, 1, 0,
> RTE_TOPO_DOMAIN_L1); ```
>
> ---
>
> #### **Error: Iteration condition in `test_main_lcore_in_domain()` uses wrong
> domain type for lookup**
>
> In `app/test/test_topology.c:211`, the loop iterates over `domain_count` for
> `domain_types[d]`, but the call to `rte_topo_is_main_lcore_in_domain()` uses
> `RTE_TOPO_DOMAIN_NUMA` instead of `domain_types[d]`. This means the test
> only checks the NUMA domain regardless of which domain type `d` selects
> (L1/L2/L3/L4).
>
> **Location:** `app/test/test_topology.c:206-216`
>
> ```c
> for (unsigned int d = 0; d < RTE_DIM(domain_types); d++) {
>     bool main_lcore_found = false;
>     unsigned int domain_count = rte_topo_get_domain_count(domain_types[d]);
>     for (unsigned int dmn_idx = 0; dmn_idx < domain_count; dmn_idx++) {
>         main_lcore_found =
> rte_topo_is_main_lcore_in_domain(RTE_TOPO_DOMAIN_NUMA,  /* BUG: should
> be domain_types[d] */
>             dmn_idx);
>         if (main_lcore_found)
>             break;
>     }
> ```
>
> **Fix:**
>
> ```c
> main_lcore_found = rte_topo_is_main_lcore_in_domain(domain_types[d], dmn_idx);
> ```
>
> ---
>
> #### **Error: Infinite loop risk in `rte_topo_get_nth_lcore_from_domain()` 
> when `ptr-
> >core_count` is 0**
>
> In `lib/eal/common/eal_topology.c:296-318`, the function enters a `while (1)` 
> loop
> that increments `new_lcore_pos`. If `ptr->core_count` is 0 (which the code 
> checks
> earlier but does not return immediately), the loop will wrap `new_lcore_pos` 
> back to
> 0 indefinitely, never breaking. While the function returns `RTE_MAX_LCORE` if 
> `ptr-
> >core_count == 0` before the loop, the logic flow is unclear and the loop 
> >body does
> not have a clear termination condition if the core count is 0.
>
> **Location:** `lib/eal/common/eal_topology.c:283-318`
>
> **Fix:** Add a sanity check inside the loop to prevent infinite iteration:
>
> ```c
> unsigned int iterations = 0;
> while (1) {
>     if (iterations++ > ptr->core_count * 2)  /* safety limit */
>         return RTE_MAX_LCORE;
>     /* ... rest of loop ... */
> }
> ```
>
> However, the real issue is that the code already returns `RTE_MAX_LCORE` if
> `ptr->core_count == 0` on line 287, so this is more of a defensive-programming
> note. The function should be refactored for clarity.
>
> ---
>
> #### **Error: Missing NULL check after `get_domain_lcore_mapping()` in
> `rte_topo_get_next_lcore()`**
>
> In `rte_topo_get_next_lcore()`, the code calls `get_domain_lcore_mapping(flag,
> lcore_domain)` and checks if `ptr` is NULL on line 350. However, if `ptr` is 
> NULL,
> the function returns `RTE_MAX_LCORE`. This is correct, but the subsequent 
> logic
> on line 381 calls `rte_topo_is_main_lcore_in_domain(flag, lcore_domain)`, 
> which
> internally may call `get_domain_lcore_mapping()` again. If that call also 
> returns
> NULL (which it will if the domain is invalid), the function
> `rte_topo_is_main_lcore_in_domain()` will return `false`, which is safe. 
> However, the
> logic is fragile and should explicitly handle the NULL case to avoid relying 
> on
> transitive safety.
>
> **Location:** `lib/eal/common/eal_topology.c:381`
>
> **Recommendation:** The code is technically safe but could be clearer. No 
> change
> required, but consider restructuring for maintainability.
>
> ---
>
> ### Patch 2/3: app/test Topology Tests
>
> #### **Error: Macro `RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN`
> declares variable in macro expansion (shadowing risk)**
>
> In `lib/eal/include/rte_topology.h:243-248`, the macro
> `RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN` declares a local
> variable `main_lcore` inside the macro expansion:
>
> ```c
> #define RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN(lcore,
> domain_indx, flag)  \
>     lcore = rte_topo_get_nth_lcore_from_domain(domain, 0, 0, flag);  \
>     uint16_t main_lcore = rte_get_main_lcore();  \
>     for (lcore = (lcore != main_lcore) ? \
>         lcore : rte_topo_get_next_lcore(lcore, 1, 0, flag);  \
>         lcore < RTE_MAX_LCORE;  \
>         lcore = rte_topo_get_next_lcore(lcore, 1, 0, flag)) ```
>
> This can cause a compiler error or shadowing if the caller already has a 
> variable
> named `main_lcore` in scope. Additionally, the macro uses `domain` (line 244) 
> but
> the parameter is `domain_indx`, which is a typo and will cause a compilation 
> error.
>
> **Location:** `lib/eal/include/rte_topology.h:243-248`
>
> **Fix:** Wrap in a `do { } while (0)` and use a uniquely-named variable, or 
> document
> that the macro must not be used if `main_lcore` is already declared. Also fix 
> the
> typo:
>
> ```c
> #define RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN(lcore,
> domain_indx, flag)  \
>     lcore = rte_topo_get_nth_lcore_from_domain(domain_indx, 0, 0, flag);  \
>     uint16_t __topo_main_lcore = rte_get_main_lcore();  \
>     for (lcore = (lcore != __topo_main_lcore) ? \
>         lcore : rte_topo_get_next_lcore(lcore, 1, 0, flag);  \
>         lcore < RTE_MAX_LCORE;  \
>         lcore = rte_topo_get_next_lcore(lcore, 1, 0, flag)) ```
>
> ---
>
> ### Summary of Correctness Bugs
>
> 1. **hwloc topology leak** on `eal_topology_map_layer()` failure 2. **Resource
> leak** in `eal_topology_map_layer()` on partial allocation failure 3. **NULL 
> pointer
> dereference** in `eal_topology_map_layer()` inner loop 4. **Logic error** in
> `get_same_l1_domains()` (same lcore assigned to `id1` and `id2`) 5. **Wrong
> domain type** in `test_main_lcore_in_domain()` (uses
> `RTE_TOPO_DOMAIN_NUMA` instead of `domain_types[d]`) 6. **Macro typo** in
> `RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN` (uses `domain`
> instead of `domain_indx`) 7. **Potential infinite loop** in
> `rte_topo_get_nth_lcore_from_domain()` if `ptr->core_count == 0` (mitigated by
> early

RE: [PATCH v5 0/3] eal/topology: introduce topology-aware lcore grouping

Reply via email to