On Wed, 7 Feb 2024 19:53:52 -0800 Saeed Mahameed wrote: > From: Tariq Toukan <[email protected]> > > Add documentation for the feature and some details on some design decisions.
Thanks. > diff --git > a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst > b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst SD which is not same SD which Jiri and William are talking about? Please spell out the name. Please make this a general networking/ documentation file. If other vendors could take a look and make sure this behavior makes sense for their plans / future devices that'd be great. > new file mode 100644 > index 000000000000..c8b4d8025a81 > --- /dev/null > +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst > @@ -0,0 +1,134 @@ > +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB > +.. include:: <isonum.txt> > + > +============================== > +Socket-Direct Netdev Combining > +============================== > + > +:Copyright: |copy| 2024, NVIDIA CORPORATION & AFFILIATES. All rights > reserved. > + > +Contents > +======== > + > +- `Background`_ > +- `Overview`_ > +- `Channels distribution`_ > +- `Steering`_ > +- `Mutually exclusive features`_ > + > +Background > +========== > + > +NVIDIA Mellanox Socket Direct technology enables several CPUs within a > multi-socket server to Please make it sound a little less like a marketing leaflet. Isn't multi-PF netdev not a better name for the construct? We don't call aRFS "queue direct", also socket has BSD socket meaning. > +connect directly to the network, each through its own dedicated PCIe > interface. Through either a > +connection harness that splits the PCIe lanes between two cards or by > bifurcating a PCIe slot for a > +single card. This results in eliminating the network traffic traversing over > the internal bus > +between the sockets, significantly reducing overhead and latency, in > addition to reducing CPU > +utilization and increasing network throughput. > + > +Overview > +======== > + > +This feature adds support for combining multiple devices (PFs) of the same > port in a Socket Direct > +environment under one netdev instance. Passing traffic through different > devices belonging to > +different NUMA sockets saves cross-numa traffic and allows apps running on > the same netdev from > +different numas to still feel a sense of proximity to the device and acheive > improved performance. > + > +We acheive this by grouping PFs together, and creating the netdev only once > all group members are > +probed. Symmetrically, we destroy the netdev once any of the PFs is removed. s/once/whenever/ > +The channels are distributed between all devices, a proper configuration > would utilize the correct > +close numa when working on a certain app/cpu. > + > +We pick one device to be a primary (leader), and it fills a special role. > The other devices "device" is probably best avoided, users may think device == card, IIUC there's only one NIC ASIC here? > +(secondaries) are disconnected from the network in the chip level (set to > silent mode). All RX/TX s/in/at/ > +traffic is steered through the primary to/from the secondaries. I don't understand the "silent" part. I mean - you do pass traffic thru them, what's the silence referring to? > +Currently, we limit the support to PFs only, and up to two devices (sockets). > + > +Channels distribution > +===================== > + > +Distribute the channels between the different SD-devices to acheive local > numa node performance on Something's missing in this sentence, subject "we"? > +multiple numas. NUMA nodes > +Each channel works against one specific mdev, creating all datapath queues > against it. We distribute The mix of channel and queue does not compute in this sentence for me. Also mdev -> PF? > +channels to mdevs in a round-robin policy. > + > +Example for 2 PFs and 6 channels: > ++-------+-------+ > +| ch ix | PF ix | ix? id or idx or index. > ++-------+-------+ > +| 0 | 0 | > +| 1 | 1 | > +| 2 | 0 | > +| 3 | 1 | > +| 4 | 0 | > +| 5 | 1 | > ++-------+-------+ > + > +This round-robin distribution policy is preferred over another suggested > intuitive distribution, in > +which we first distribute one half of the channels to PF0 and then the > second half to PF1. Preferred.. by whom? Just say that's the most broadly useful and therefore default config. > +The reason we prefer round-robin is, it is less influenced by changes in the > number of channels. The > +mapping between a channel index and a PF is fixed, no matter how many > channels the user configures. > +As the channel stats are persistent to channels closure, changing the > mapping every single time to -> across channels -> channel or channel's or channel closures > +would turn the accumulative stats less representing of the channel's history. > + > +This is acheived by using the correct core device instance (mdev) in each > channel, instead of them > +all using the same instance under "priv->mdev". > + > +Steering > +======== > +Secondary PFs are set to "silent" mode, meaning they are disconnected from > the network. > + > +In RX, the steering tables belong to the primary PF only, and it is its role > to distribute incoming > +traffic to other PFs, via advanced HW cross-vhca steering capabilities. s/advanced HW// You should cover how RSS looks - single table which functions exactly as it would for a 1-PF device? Two-tier setup? > +In TX, the primary PF creates a new TX flow table, which is aliased by the > secondaries, so they can > +go out to the network through it. > + > +In addition, we set default XPS configuration that, based on the cpu, > selects an SQ belonging to the > +PF on the same node as the cpu. > + > +XPS default config example: > + > +NUMA node(s): 2 > +NUMA node0 CPU(s): 0-11 > +NUMA node1 CPU(s): 12-23 > + > +PF0 on node0, PF1 on node1. You didn't cover how users are supposed to discover the topology. netdev is linked to a single device in sysfs, which is how we get netdev <> NUMA node mapping today. What's the expected way to get the NUMA nodes here? And obviously this can't get merged until mlx5 exposes queue <> NAPI <> IRQ mapping via the netdev genl. <snip> > +Mutually exclusive features > +=========================== > + > +The nature of socket direct, where different channels work with different > PFs, conflicts with > +stateful features where the state is maintained in one of the PFs. > +For exmaple, in the TLS device-offload feature, special context objects are > created per connection > +and maintained in the PF. Transitioning between different RQs/SQs would > break the feature. Hence, > +we disable this combination for now.
