mlx5: Add description for Socket-Direct netdev combining

Jakub Kicinski Fri, 09 Feb 2024 22:27:59 -0800

On Wed,  7 Feb 2024 19:53:52 -0800 Saeed Mahameed wrote:
> From: Tariq Toukan <[email protected]>
> 
> Add documentation for the feature and some details on some design decisions.


Thanks.

> diff --git 
> a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst 
> b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst

SD which is not same SD which Jiri and William are talking about?
Please spell out the name.

Please make this a general networking/ documentation file.

If other vendors could take a look and make sure this behavior makes
sense for their plans / future devices that'd be great.

> new file mode 100644
> index 000000000000..c8b4d8025a81
> --- /dev/null
> +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
> @@ -0,0 +1,134 @@
> +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +.. include:: <isonum.txt>
> +
> +==============================
> +Socket-Direct Netdev Combining
> +==============================
> +
> +:Copyright: |copy| 2024, NVIDIA CORPORATION & AFFILIATES. All rights 
> reserved.
> +
> +Contents
> +========
> +
> +- `Background`_
> +- `Overview`_
> +- `Channels distribution`_
> +- `Steering`_
> +- `Mutually exclusive features`_
> +
> +Background
> +==========
> +
> +NVIDIA Mellanox Socket Direct technology enables several CPUs within a 
> multi-socket server to

Please make it sound a little less like a marketing leaflet.
Isn't multi-PF netdev not a better name for the construct?
We don't call aRFS "queue direct", also socket has BSD socket meaning.

> +connect directly to the network, each through its own dedicated PCIe 
> interface. Through either a
> +connection harness that splits the PCIe lanes between two cards or by 
> bifurcating a PCIe slot for a
> +single card. This results in eliminating the network traffic traversing over 
> the internal bus
> +between the sockets, significantly reducing overhead and latency, in 
> addition to reducing CPU
> +utilization and increasing network throughput.
> +
> +Overview
> +========
> +
> +This feature adds support for combining multiple devices (PFs) of the same 
> port in a Socket Direct
> +environment under one netdev instance. Passing traffic through different 
> devices belonging to
> +different NUMA sockets saves cross-numa traffic and allows apps running on 
> the same netdev from
> +different numas to still feel a sense of proximity to the device and acheive 
> improved performance.
> +
> +We acheive this by grouping PFs together, and creating the netdev only once 
> all group members are
> +probed. Symmetrically, we destroy the netdev once any of the PFs is removed.

s/once/whenever/

> +The channels are distributed between all devices, a proper configuration 
> would utilize the correct
> +close numa when working on a certain app/cpu.
> +
> +We pick one device to be a primary (leader), and it fills a special role. 
> The other devices

"device" is probably best avoided, users may think device == card,
IIUC there's only one NIC ASIC here?

> +(secondaries) are disconnected from the network in the chip level (set to 
> silent mode). All RX/TX

s/in/at/

> +traffic is steered through the primary to/from the secondaries.

I don't understand the "silent" part. I mean - you do pass traffic thru
them, what's the silence referring to?

> +Currently, we limit the support to PFs only, and up to two devices (sockets).
> +
> +Channels distribution
> +=====================
> +
> +Distribute the channels between the different SD-devices to acheive local 
> numa node performance on

Something's missing in this sentence, subject "we"? 

> +multiple numas.

NUMA nodes

> +Each channel works against one specific mdev, creating all datapath queues 
> against it. We distribute

The mix of channel and queue does not compute in this sentence for me.

Also mdev -> PF?

> +channels to mdevs in a round-robin policy.
> +
> +Example for 2 PFs and 6 channels:
> ++-------+-------+
> +| ch ix | PF ix |

ix? id or idx or index.

> ++-------+-------+
> +|   0   |   0   |
> +|   1   |   1   |
> +|   2   |   0   |
> +|   3   |   1   |
> +|   4   |   0   |
> +|   5   |   1   |
> ++-------+-------+
> +
> +This round-robin distribution policy is preferred over another suggested 
> intuitive distribution, in
> +which we first distribute one half of the channels to PF0 and then the 
> second half to PF1.

Preferred.. by whom? Just say that's the most broadly useful and therefore 
default config.

> +The reason we prefer round-robin is, it is less influenced by changes in the 
> number of channels. The
> +mapping between a channel index and a PF is fixed, no matter how many 
> channels the user configures.
> +As the channel stats are persistent to channels closure, changing the 
> mapping every single time

to -> across
channels -> channel or channel's or channel closures

> +would turn the accumulative stats less representing of the channel's history.
> +
> +This is acheived by using the correct core device instance (mdev) in each 
> channel, instead of them
> +all using the same instance under "priv->mdev".
> +
> +Steering
> +========
> +Secondary PFs are set to "silent" mode, meaning they are disconnected from 
> the network.
> +
> +In RX, the steering tables belong to the primary PF only, and it is its role 
> to distribute incoming
> +traffic to other PFs, via advanced HW cross-vhca steering capabilities.

s/advanced HW//

You should cover how RSS looks - single table which functions exactly as
it would for a 1-PF device? Two-tier setup?

> +In TX, the primary PF creates a new TX flow table, which is aliased by the 
> secondaries, so they can
> +go out to the network through it.
> +
> +In addition, we set default XPS configuration that, based on the cpu, 
> selects an SQ belonging to the
> +PF on the same node as the cpu.
> +
> +XPS default config example:
> +
> +NUMA node(s):          2
> +NUMA node0 CPU(s):     0-11
> +NUMA node1 CPU(s):     12-23
> +
> +PF0 on node0, PF1 on node1.

You didn't cover how users are supposed to discover the topology. 
netdev is linked to a single device in sysfs, which is how we get
netdev <> NUMA node mapping today. What's the expected way to get
the NUMA nodes here?

And obviously this can't get merged until mlx5 exposes queue <> NAPI <>
IRQ mapping via the netdev genl.

<snip>

> +Mutually exclusive features
> +===========================
> +
> +The nature of socket direct, where different channels work with different 
> PFs, conflicts with
> +stateful features where the state is maintained in one of the PFs.
> +For exmaple, in the TLS device-offload feature, special context objects are 
> created per connection
> +and maintained in the PF.  Transitioning between different RQs/SQs would 
> break the feature. Hence,
> +we disable this combination for now.

Re: [Intel-wired-lan] [net-next V2 15/15] Documentation: net/mlx5: Add description for Socket-Direct netdev combining

Reply via email to