Hi Anurag,
On 23/06/2022 11:18, Anurag Agarwal wrote:
From: Jan Scheurich<[email protected]>
Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic
assignment of the rxqs of a port only if there are no local,non-isolated PMDs.
On typical servers with both physical ports on one NUMA node, this often
leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources.
The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also
has drawbacks as it limits OVS' ability to auto load-balance the rxqs.
This patch introduces a new interface configuration option to allow ports to
be automatically polled by PMDs on any NUMA node:
ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
The group assignment algorithm now has the ability to select lowest loaded PMD
on any NUMA, and not just the local NUMA on which the rxq of the port resides
If this option is not present or set to false, legacy behaviour applies.
Co-authored-by: Anurag Agarwal<[email protected]>
Signed-off-by: Jan Scheurich<[email protected]>
Signed-off-by: Anurag Agarwal<[email protected]>
---
Changes in v5:
- Addressed comments from<[email protected]>
- First schedule rxqs that are not enabled for cross-numa scheduling
- Follow this with rxqs that are enabled for cross-numa scheduling
I don't think this is a correct fix for the issue reported. The root
problem reported is not really that rxqs with cross-numa=true are
assigned first, but that the pool or pmd resources is
changing/overlapping during the assignments i.e. in the reported case
from a full pool to a fixed per-numa pool.
With the change you have now, you could have something like:
3 rxqs, (1 cross-numa) and 2 pmds.
cross-numa=true rxq load: 80%
per-numa rxq loads: 45%, 40%
rxq assignment rounds
1.
pmd0 = 45
pmd1 = 0
2.
pmd0 = 45
pmd1 = 40
3.
pmd0 = 45 = 40%
pmd1 = 40 + 80 = 120%
when clearly the best way is:
pmd0 = 80
pmd1 = 45 + 40
So it's not about which ones gets assigned first, as is shown that can
cause an issue whichever one is assigned first. The problem is that the
algorithm for assignment is not designed for changing and overlapping
ranges of pmds to assign rxqs to.
To fix this, you would probably need to do all the assignments first as
per v4 and then do another round of checking and possibly moving some
cross-numa=true rxqs. But that is further relying on estimates which you
are making potentially inaccurate. If you are writing something to
"move" individual rxqs after initial assignment, maybe it's better to
rethink doing it in the ALB with the real loads and not estimates.
It is worth noting again that while less flexible, if the rxq load was
distributed on an interface with RSS etc, some pinning of phy rxqs can
allow cross-numa and remove any inaccurate estimates from changing numa.
thanks,
Kevin.
Changes in v4:
- Addressed comments from Kevin Traynor<[email protected]>
Please refer this thread for an earlier discussion on this topic
https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392310.html
Documentation/topics/dpdk/pmd.rst | 23 +++++
lib/dpif-netdev.c | 156 +++++++++++++++++++++++-------
tests/pmd.at | 38 ++++++++
vswitchd/vswitch.xml | 20 ++++
4 files changed, 201 insertions(+), 36 deletions(-)
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev