>Add rte_node_enqueue_deferred() which tracks runs of consecutive
>objects going to the same edge and flushes them efficiently in bulk.
>When all objects go to the same edge (the common case), the function
>uses rte_node_next_stream_move() which swaps pointers instead of
>copying data.
>
>The deferred state (run_start and last_edge) is stored in the node
>fast-path cache line 1, keeping it close to other frequently accessed
>node data. The last_edge is preserved across node invocations,
>allowing speculation: if traffic continues to the same destination,
>no action is needed until the edge changes.
>
>The flush is performed automatically at the end of node processing
>by __rte_node_process().
>
>Signed-off-by: Robin Jarry <[email protected]>
>---
> lib/graph/graph_populate.c          |  1 +
> lib/graph/rte_graph_worker_common.h | 75 +++++++++++++++++++++++++++++
> 2 files changed, 76 insertions(+)
>

<snip>

>+/**
>+ * Enqueue objects to a next node in a cache-efficient deferred manner.
>+ *
>+ * This function tracks runs of objects going to the same edge. When the edge
>+ * changes, the previous run is flushed using bulk enqueue. At the end of node
>+ * processing, any remaining objects are flushed automatically. When all
>+ * objects go to the same edge (the common case), rte_node_next_stream_move()
>+ * is used which swaps pointers instead of copying.
>+ *
>+ * The function does not require consecutive idx values. It can be called with
>+ * any stride (e.g., 0, 4, 8, ... to process batches of 4). All objects from
>+ * the previous idx up to the current one are considered part of the current
>+ * run until the edge changes.
>+ *
>+ * For homogeneous traffic, the destination node structure is touched once
>+ * per batch instead of once per object, reducing cache line bouncing.
>+ *
>+ * @param graph
>+ *   Graph pointer returned from rte_graph_lookup().
>+ * @param node
>+ *   Current node pointer.
>+ * @param next
>+ *   Next node edge index.
>+ * @param idx
>+ *   Index of the current object being processed in node->objs[].
>+ *
>+ * @see rte_node_next_stream_move().
>+ */
>+static inline void
>+rte_node_enqueue_deferred(struct rte_graph *graph, struct rte_node *node,
>+                         rte_edge_t next, uint16_t idx)
>+{
>+       if (next != node->deferred_last_edge) {
>+               /* edge changed, flush previous run if not empty */
>+               if (idx > node->deferred_run_start)
>+                       rte_node_enqueue(graph, node, node->deferred_last_edge,
>+                                        &node->objs[node->deferred_run_start],
>+                                        idx - node->deferred_run_start);
>+               node->deferred_run_start = idx;
>+               node->deferred_last_edge = next;
>+       }
>+}
>+

Can we add a deferredx4 variant too? It need not have SIMD but would reduce LoC
further.

Thanks,
Pavan.

Reply via email to