On 09/02/2026 13:17, David Marchand wrote:
External email: Use caution opening links or attachments


By default, DPDK probes all available resources (like PCI devices) and
partially initialises them (/ takes over them).
This behavior has been relied on by OVS, since netdev-dpdk introduction.
It is not really needed anymore since DPDK device hotplug has been
supported for some time now.

Besides, this initial probing may not be desirable:
- for PCI devices bound to vfio-pci, the first application taking over
   them "wins", meaning that OVS would prevent qemu from using some VF
   devices,
- for mlx5 devices, the driver maintains link status and liveness
   of all ports (taking some kernel lock) even when OVS only uses
   a subset of them,

Not only that:

- if we want to use dpdk devargs, they cause issues as the ports were already probed with their defaults in the initial probe.

- devices that may be dynamic (SFs for example) are probed at init, but if the SF is destroyed/re-created, it is not re-probed, making it to be non-functional.

Note: DPDK also probes other busses. This patch (as well as mine [1]) only handles PCI. In our downstream version we also add "-a auxiliary:mlx5_core.sf.0" to disable auxiliary bus probes.

I didn't include that in my patch as it seemed too "vendor specific", though re-thinking about it, there is no harm. WDYT?

[1] https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/


Change this behavior and disable the initial PCI probing by passing
a pci:0000:00:00.0 allow list.
This is not elegant but it has been used in a number of setups I know,
and there is no better DPDK API to achieve the same at the moment.
Preventing probing from other buses (that DPDK supports) is not
implemented for now, but could be added when DPDK offers an API.

This behavior change breaks setups that were using the
class=eth,mac=XX:XX:XX:XX:XX:XX syntax because OVS was relying on the
(fragile) assumption that all DPDK ports were probed at init once and
for all.
Add a warning for users of this syntax, update the documentation and
add an option to restore the original behavior via
'dpdk-probe-at-init=true'.
The user can still workaround this new behavior "breakage" by setting for example "-b 0000:00:00.0", making this new knob redundant.

This option also helps for unexpected cases like https://xkcd.com/1172/.
Funny indeed. I was wondering why do we need a way to have the old behavior at all...

Signed-off-by: David Marchand <[email protected]>
---
  Documentation/howto/dpdk.rst         |  6 +++++
  Documentation/intro/install/dpdk.rst |  8 ++++++
  NEWS                                 |  5 ++++
  lib/dpdk.c                           |  8 ++++++
  lib/netdev-dpdk.c                    |  2 +-
  tests/system-dpdk-macros.at          |  2 +-
  tests/system-dpdk.at                 | 40 ++++++++++++++--------------
  vswitchd/vswitch.xml                 | 15 +++++++++++
  8 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
index 73e630b07f..ce31dd1055 100644
--- a/Documentation/howto/dpdk.rst
+++ b/Documentation/howto/dpdk.rst
@@ -62,6 +62,12 @@ is suggested::

  .. important::

+    Using this syntax requires that DPDK probes the device owning those
+    multiple ports. This can be achieved by either setting an allowed list
+    of PCI devices in the ``dpdk-extra`` configuration, or by asking for
+    probing all devices available (PCI devices included) at initialisation
+    (setting ``dpdk-probe-at-init`` to true).
+
      Hotplugging physical interfaces is not supported using the above syntax.
      This is expected to change with the release of DPDK v18.05. For 
information
      on hotplugging physical interfaces, you should instead refer to
diff --git a/Documentation/intro/install/dpdk.rst 
b/Documentation/intro/install/dpdk.rst
index 6f4687bdea..a2ecdf36de 100644
--- a/Documentation/intro/install/dpdk.rst
+++ b/Documentation/intro/install/dpdk.rst
@@ -297,6 +297,14 @@ listed below. Defaults will be provided for all values not 
explicitly set.
    sockets. If not specified, this option will not be set by default. DPDK
    default will be used instead.

+``dpdk-probe-at-init``
+  Let DPDK EAL probe all available devices at initialisation.
+  This consumes more resources as OVS may not use all probed devices and this
+  may have undesired side effects (like taking the RTNL lock frequently for
+  maintaining link status (and other states etc..) of mlx5 netdevs that OVS
+  does not care about. However, this option is needed when using the
+  ``class=eth,mac=XX:XX:XX:XX:XX:XX`` syntax for DPDK ports.
+
  ``dpdk-hugepage-dir``
    Directory where hugetlbfs is mounted

diff --git a/NEWS b/NEWS
index c3470b84ec..49ce028f9b 100644
--- a/NEWS
+++ b/NEWS
@@ -1,5 +1,10 @@
  Post-v3.7.0
  --------------------
+   - DPDK:
+     * Probing of devices at DPDK init has been disabled to avoid wasting
+       resources on unused devices. This breaks DPDK netdev ports using
+       "class=eth,mac=" syntax (though it can be restored, see
+       Documentation/howto/dpdk.rst).


  v3.7.0 - xx xxx xxxx
diff --git a/lib/dpdk.c b/lib/dpdk.c
index d27b95cd9a..65e52e4a6e 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -430,6 +430,14 @@ dpdk_init__(const struct smap *ovs_other_config)
          svec_add_nocopy(&args, xasprintf("0@%d", cpu));
      }

+    if (!args_contains(&args, "-a") && !args_contains(&args, "-b")
+        && !smap_get_bool(ovs_other_config, "dpdk-probe-at-init", false)) {
+        /* Prevent DPDK from probing PCI devices, unless some -a/-b is set in
+         * config. */
+        svec_add(&args, "-a");
+        svec_add(&args, "pci:0000:00:00.0");
+    }
+
      svec_terminate(&args);

      optind = 1;
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index d3f8710e38..d657606e21 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2050,7 +2050,7 @@ netdev_dpdk_get_port_by_mac(const char *mac_str, char 
**extra_err)
          }
      }

-    *extra_err = xstrdup("unknown mac");
+    *extra_err = xstrdup("unknown mac (use dpdk-probe-at-init=true?)");
      return DPDK_ETH_PORT_ID_INVALID;
  }

diff --git a/tests/system-dpdk-macros.at b/tests/system-dpdk-macros.at
index f8ba766739..716d8a357d 100644
--- a/tests/system-dpdk-macros.at
+++ b/tests/system-dpdk-macros.at
@@ -139,7 +139,7 @@ m4_define([OVS_TRAFFIC_VSWITCHD_START],
     OVS_DPDK_PRE_CHECK()
     OVS_WAIT_WHILE([ip link show ovs-netdev])
     dnl For functional tests, no need for DPDK PCI probing.
-   OVS_DPDK_START([--no-pci], [--disable-system], [$3])
+   OVS_DPDK_START([], [--disable-system], [$3])
     dnl Add bridges, ports, etc.
     OVS_WAIT_WHILE([ip link show br0])
     AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| 
uuidfilt])], [0], [$2])
diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 17d3d25955..bd1bead661 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -43,7 +43,7 @@ dnl Check if EAL init is successful
  AT_SETUP([OVS-DPDK - EAL init])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()
  AT_CHECK([grep "DPDK Enabled - initializing..." ovs-vswitchd.log], [], 
[stdout])
  AT_CHECK([grep "EAL" ovs-vswitchd.log], [], [stdout])
  AT_CHECK([grep "DPDK Enabled - initialized" ovs-vswitchd.log], [], [stdout])
@@ -59,7 +59,7 @@ AT_SETUP([OVS-DPDK - dpdk-lcore-mask conversion - single])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
  OVS_DPDK_START_OVSDB()
-OVS_DPDK_START_VSWITCHD([--no-pci])
+OVS_DPDK_START_VSWITCHD()
  CHECK_CPU_DISCOVERED()
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-lcore-mask=0x1])
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true])
@@ -77,7 +77,7 @@ AT_SETUP([OVS-DPDK - dpdk-lcore-mask conversion - multi])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
  OVS_DPDK_START_OVSDB()
-OVS_DPDK_START_VSWITCHD([--no-pci])
+OVS_DPDK_START_VSWITCHD()
  CHECK_CPU_DISCOVERED(4)
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-lcore-mask=0xf])
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true])
@@ -95,7 +95,7 @@ AT_SETUP([OVS-DPDK - dpdk-lcore-mask conversion - non-contig])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
  OVS_DPDK_START_OVSDB()
-OVS_DPDK_START_VSWITCHD([--no-pci])
+OVS_DPDK_START_VSWITCHD()
  CHECK_CPU_DISCOVERED(8)
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-lcore-mask=0xca])
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true])
@@ -113,7 +113,7 @@ AT_SETUP([OVS-DPDK - dpdk-lcore-mask conversion - zeromask])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
  OVS_DPDK_START_OVSDB()
-OVS_DPDK_START_VSWITCHD([--no-pci])
+OVS_DPDK_START_VSWITCHD()
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-lcore-mask=0x0])
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true])
  OVS_WAIT_UNTIL([grep "Ignoring database defined option 'dpdk-lcore-mask' due to 
invalid value '0x0'" ovs-vswitchd.log])
@@ -152,7 +152,7 @@ dnl Add vhost-user-client port
  AT_SETUP([OVS-DPDK - add vhost-user-client port])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -181,7 +181,7 @@ AT_SETUP([OVS-DPDK - ping vhost-user ports])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
  OVS_DPDK_CHECK_TESTPMD()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -237,7 +237,7 @@ AT_SETUP([OVS-DPDK - ping vhost-user-client ports])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
  OVS_DPDK_CHECK_TESTPMD()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -350,7 +350,7 @@ AT_SETUP([OVS-DPDK - Ingress policing create delete vport 
port])
  AT_KEYWORDS([dpdk])

  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and add ingress policer
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -387,7 +387,7 @@ AT_SETUP([OVS-DPDK - Ingress policing no policing rate])
  AT_KEYWORDS([dpdk])

  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and add ingress policer
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -421,7 +421,7 @@ AT_SETUP([OVS-DPDK - Ingress policing no policing burst])
  AT_KEYWORDS([dpdk])

  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and add ingress policer
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -487,7 +487,7 @@ AT_SETUP([OVS-DPDK - QoS create delete vport port])
  AT_KEYWORDS([dpdk])

  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and add egress policer
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -522,7 +522,7 @@ AT_SETUP([OVS-DPDK - QoS no cir])
  AT_KEYWORDS([dpdk])

  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and add egress policer
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -551,7 +551,7 @@ AT_SETUP([OVS-DPDK - QoS no cbs])
  AT_KEYWORDS([dpdk])

  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and add egress policer
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -657,7 +657,7 @@ AT_KEYWORDS([dpdk])

  OVS_DPDK_CHECK_TESTPMD()
  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS with default MTU value
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -698,7 +698,7 @@ AT_KEYWORDS([dpdk])

  OVS_DPDK_CHECK_TESTPMD()
  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and modify MTU value
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -816,7 +816,7 @@ AT_KEYWORDS([dpdk])

  OVS_DPDK_CHECK_TESTPMD()
  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and set MTU value to max upper 
bound
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -858,7 +858,7 @@ AT_KEYWORDS([dpdk])

  OVS_DPDK_CHECK_TESTPMD()
  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()

  dnl Add userspace bridge and attach it to OVS and set MTU value to min lower 
bound
  AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
@@ -897,7 +897,7 @@ AT_SETUP([OVS-DPDK - user configured mempool])
  AT_KEYWORDS([dpdk])
  OVS_DPDK_PRE_CHECK()
  OVS_DPDK_START_OVSDB()
-OVS_DPDK_START_VSWITCHD([--no-pci])
+OVS_DPDK_START_VSWITCHD()

  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:shared-mempool-config=8000,6000,1500])
  AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true])
@@ -946,7 +946,7 @@ dnl 
--------------------------------------------------------------------------
  AT_SETUP([OVS-DPDK - ovs-appctl dpif/offload/show])
  AT_KEYWORDS([dpdk dpif-offload])
  OVS_DPDK_PRE_CHECK()
-OVS_DPDK_START([--no-pci])
+OVS_DPDK_START()
  AT_CHECK([ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev])
  AT_CHECK([ovs-vsctl add-port br0 p1 \
    -- set Interface p1 type=dpdk options:dpdk-devargs=net_null0,no-rx=1],
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index b7a5afc0a5..2bd101bf2e 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -453,6 +453,21 @@
          </p>
        </column>

+      <column name="other_config" key="dpdk-probe-at-init"
+              type='{"type": "boolean"}'>
+        <p>
+          Specifies whether DPDK should probe all devices available at the
+          time DPDK is initialised. This is required when declaring DPDK ports
+          using the "class=eth,mac=XX:XX:XX:XX:XX:XX" syntax but beware that
+          it implies more resources consumption and undesired side effects
+          with some devices (like mlx5).
+        </p>
+        <p>
+          If not specified, DPDK will probe no device at initialisation,
+          which should be fine in most cases.
+        </p>
+      </column>
+
        <column name="other_config" key="dpdk-extra"
                type='{"type": "string"}'>
          <p>
--
2.52.0

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to