Hello Kevin,
I have prepared a patch for "per port cross-numa-polling" and
attached herewith.
The results are captured in 'cross-numa-results.txt'. We see PMD to RxQ
assignment evenly balanced across all PMDs with this patch.
Please take a look and let us know your inputs.
Please find some of my inputs inline, in response to your comments.
Regards,
Anurag
-----Original Message-----
From: Kevin Traynor <[email protected]>
Sent: Thursday, February 24, 2022 7:54 PM
To: Jan Scheurich <[email protected]>; Wan Junjie
<[email protected]>
Cc: [email protected]; Anurag Agarwal <[email protected]>
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to assign pmd
rxq to all numas
Hi Jan,
On 17/02/2022 14:21, Jan Scheurich wrote:
> Hi Kevin,
>
>>> We have done extensive benchmarking and found that we get better
>>> overall
>> PMD load balance and resulting OVS performance when we do not
>> statically pin any rx queues and instead let the auto-load-balancing
>> find the optimal distribution of phy rx queues over both NUMA nodes
>> to balance an asymmetric load of vhu rx queues (polled only on the local
>> NUMA node).
>>>
>>> Cross-NUMA polling of vhu rx queues comes with a very high latency
>>> cost due
>> to cross-NUMA access to volatile virtio ring pointers in every
>> iteration (not only when actually copying packets). Cross-NUMA
>> polling of phy rx queues doesn't have a similar issue.
>>>
>>
>> I agree that for vhost rxq polling, it always causes a performance
>> penalty when there is cross-numa polling.
>>
>> For polling phy rxq, when phy and vhost are in different numas, I
>> don't see any additional penalty for cross-numa polling the phy rxq.
>>
>> For the case where phy and vhost are both in the same numa, if I
>> change to poll the phy rxq cross-numa, then I see about a >20% tput
>> drop for traffic from phy -
>>> vhost. Are you seeing that too?
>
> Yes, but the performance drop is mostly due to the extra cost of copying the
> packets across the UPI bus to the virtio buffers on the other NUMA, not
> because of polling the phy rxq on the other NUMA.
>
Just to be clear, phy and vhost are on the same numa in my test. I see the drop
when polling the phy rxq with a pmd from a different numa.
>>
>> Also, the fact that a different numa can poll the phy rxq after every
>> rebalance means that the ability of the auto-load-balancer to
>> estimate and trigger a rebalance is impacted.
>
> Agree, there is some inaccuracy in the estimation of the load a phy rx queue
> creates when it is moved to another NUMA node. So far we have not seen that
> as a practical problem.
>
>>
>> It seems like simple pinning some phy rxqs cross-numa would avoid all
>> the issues above and give most of the benefit of cross-numa polling for phy
>> rxqs.
>
> That is what we have done in the past (far a lack of alternatives). But any
> static pinning reduces the ability of the auto-load balancer to do its job.
> Consider the following scenarios:
>
> 1. The phy ingress traffic is not evenly distributed by RSS due to lack of
> entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or
> MPLSoGRE encapsulated traffic).
>
> 2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu
> ports are all on NUMA 0.
>
> In all such scenarios, static pinning of phy rxqs may lead to unnecessarily
> uneven PMD load and loss of overall capacity.
>
I agree that static pinning may cause a bottleneck if you have more than one rx
pinned on a core. On the flip side, pinning removes uncertainty about the
ability of OVS to make good assignments and ALB.
[Anurag] Echoing what Jan said, static pinning wouldn't allow rebalancing in
case the traffic across DPDK and VHU queues is asymmetric. With the
introduction of per port cross-numa-polling, the user has one more option in
his tool box, to allow full auto load balancing without worrying at all about
the rxq to PMD assignments. This also makes the deployment of OVS much simpler.
The user only now needs to provide the list of CPUs, enable AUTO LB and
cross-numa-polling (in case necessary). All of the rest is handled in software.
>>
>> With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS
>> could still assign other rxqs to those cores which have with pinned
>> phy rxqs and properly adjust the assignments based on the load from the
>> pinned rxqs.
>
> Yes, sometimes the vhu rxq load is distributed such that it can be use to
> balance the PMD, but not always. Sometimes the balance is just better when
> phy rxqs are not pinned.
>
>>
>> New assignments or auto-load-balance would not change the numa
>> polling those rxqs, so it it would have no impact to ALB or ability
>> to assign based on load.
>
> In our practical experience the new "group" algorithm for load-based rxq
> distribution is able to balance the PMD load best when none of the rxqs are
> pinned and cross-NUMA polling of phy rxqs is enabled. So the effect of the
> prediction error when doing auto-lb dry-runs cannot be significant.
>
It could definitely be significant in some cases but it depends on a lot of
factors to know that.
> In our experience we consistently get the best PMD balance and OVS throughput
> when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs,
> through).
>
> BR, Jan
Thanks for sharing your experience with it. My fear with the proposal is that
someone turns this on and then tells us performance is worse and/or OVS
assignments/ALB are broken, because it has an impact on their case.
[Anurag] We have run tests with the per port cross-numa patch, please find the
results to attached. We have more detailed results available for a 2 Core and 4
Core OVS/PMD resource allocation (i.e.4 PMDs and 8 PMDs available for OVS
respectively). The ALB algorithm was able to load balance and distribute rxq to
PMD evenly for both UDP over VLAN and UDP over VxLAN traffic, and also when
combined with other features such as security group.
In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports
[Anurag] I believe this might be a reasonable approach. Patch for this is
attached for your reference.
- could print a warning when it is enabled
[Anurag] Might be a reasonable thing to do. There seems to be already some
logging to indicate warning, when a rxq is polled by non-local NUMA PMD.
- ALB is currently disabled with cross-numa polling (except a limited
case) but it's clear you want to remove that restriction too
[Anurag] Yes. We exercise cross-numa-polling with 'group' scheduling and PMD
auto-lb enabled today, in our solution, and it would be nice to support this
with OVS master as well.
- for ALB, a user could increase the improvement threshold to account for any
reassignments triggered by inaccuracies
There is also some improvements that can be made to the proposed method when
used with group assignment,
- we can prefer local numa where there is no difference between pmd cores.
(e.g. two unused cores available, pick the local numa one)
- we can flatten the list of pmds, so best pmd can be selected. This will
remove issues with RR numa when there are different num of pmd cores or loads
per numa.
- I wrote an RFC that does these two items, I can post when(/if!) consensus is
reached on the broader topic
In summary, it's a trade-off,
With no cross-numa polling (current):
- won't have any impact to OVS assignment or ALB accuracy
- there could be a bottleneck on one numa pmds while other numa pmd cores are
idle and unused
With cross-numa rx pinning (current):
- will have access to pmd cores on all numas
- may require more cycles for some traffic paths
- won't have any impact to OVS assignment or ALB accuracy
- >1 pinned rxqs per core may cause a bottleneck depending on traffic
With cross-numa interface setting (proposed):
- will have access to all pmd cores on all numas (i.e. no unused pmd cores
during highest load)
- will require more cycles for some traffic paths
- will impact on OVS assignment and ALB accuracy
Anything missing above, or is it a reasonable summary?
[Anurag] Seems like a good summary to me. Thanks Kevin.
thanks,
Kevin.
Scenario: UDP over VLAN, UDP over VxLAN
Packet Size: 64 Bytes
2 VM : 1 NUMA0 + 1 NUMA1
Flows: 10k
CPU LAYOUT
----------
Socket 0 Socket 1
-------- --------
Core 0 [0, 16] [1, 17]
Core 1 [2, 18] [3, 19]
Core 2 [4, 20] [5, 21]
Core 3 [6, 22] [7, 23]
Core 4 [8, 24] [9, 25]
Core 5 [10, 26] [11, 27]
Core 6 [12, 28] [13, 29]
Core 7 [14, 30] [15, 31]
dell32:/home/sdn/zkumdhe #
PMDs : 2,3,4,5,18,19,20,21
VM cpuset : VM1 (6,8,10,12,22,24,26,28) , VM2 (7,9,11,13,23,25,27,29)
1) Software: OVS 2.16 + DPDK 20.11.4 (cross-numa-polling not supported)
------------------------------------
1.1) UDP over VLAN
--------------------
pmd thread numa_id 0 core_id 2:
isolated : false
port: vhu-vm1p1 queue-id: 1 (enabled) pmd usage: 9 %
port: vhu-vm1p1 queue-id: 4 (enabled) pmd usage: 0 %
overhead: 1 %
pmd thread numa_id 1 core_id 3:
isolated : false
port: dpdk0 queue-id: 3 (enabled) pmd usage: 6 %
port: dpdk0 queue-id: 6 (enabled) pmd usage: 7 %
port: vhu-vm2p1 queue-id: 0 (enabled) pmd usage: 5 %
port: vhu-vm2p1 queue-id: 4 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 0 core_id 4:
isolated : false
port: vhu-vm1p1 queue-id: 0 (enabled) pmd usage: 8 %
port: vhu-vm1p1 queue-id: 5 (enabled) pmd usage: 0 %
overhead: 2 %
pmd thread numa_id 1 core_id 5:
isolated : false
port: dpdk0 queue-id: 0 (enabled) pmd usage: 6 %
port: dpdk0 queue-id: 4 (enabled) pmd usage: 6 %
port: vhu-vm2p1 queue-id: 1 (enabled) pmd usage: 5 %
port: vhu-vm2p1 queue-id: 5 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 0 core_id 18:
isolated : false
port: vhu-vm1p1 queue-id: 3 (enabled) pmd usage: 8 %
port: vhu-vm1p1 queue-id: 6 (enabled) pmd usage: 0 %
overhead: 2 %
pmd thread numa_id 1 core_id 19:
isolated : false
port: dpdk0 queue-id: 1 (enabled) pmd usage: 6 %
port: dpdk0 queue-id: 5 (enabled) pmd usage: 6 %
port: vhu-vm2p1 queue-id: 3 (enabled) pmd usage: 6 %
port: vhu-vm2p1 queue-id: 6 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 0 core_id 20:
isolated : false
port: vhu-vm1p1 queue-id: 2 (enabled) pmd usage: 8 %
port: vhu-vm1p1 queue-id: 7 (enabled) pmd usage: 0 %
overhead: 2 %
pmd thread numa_id 1 core_id 21:
isolated : false
port: dpdk0 queue-id: 2 (enabled) pmd usage: 6 %
port: dpdk0 queue-id: 7 (enabled) pmd usage: 6 %
port: vhu-vm2p1 queue-id: 2 (enabled) pmd usage: 5 %
port: vhu-vm2p1 queue-id: 7 (enabled) pmd usage: 0 %
overhead: 0 %
1.2) UDP over VxLAN
-------------------
pmd thread numa_id 0 core_id 2:
isolated : false
port: vhu-vm1p1 queue-id: 2 (enabled) pmd usage: 11 %
port: vhu-vm1p1 queue-id: 4 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 1 core_id 3:
isolated : false
port: dpdk0 queue-id: 0 (enabled) pmd usage: 13 %
port: dpdk0 queue-id: 2 (enabled) pmd usage: 15 %
port: vhu-vm2p1 queue-id: 3 (enabled) pmd usage: 9 %
port: vhu-vm2p1 queue-id: 4 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 0 core_id 4:
isolated : false
port: vhu-vm1p1 queue-id: 3 (enabled) pmd usage: 11 %
port: vhu-vm1p1 queue-id: 5 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 1 core_id 5:
isolated : false
port: dpdk0 queue-id: 1 (enabled) pmd usage: 13 %
port: dpdk0 queue-id: 3 (enabled) pmd usage: 14 %
port: vhu-vm2p1 queue-id: 2 (enabled) pmd usage: 9 %
port: vhu-vm2p1 queue-id: 5 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 0 core_id 18:
isolated : false
port: vhu-vm1p1 queue-id: 1 (enabled) pmd usage: 10 %
port: vhu-vm1p1 queue-id: 6 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 1 core_id 19:
isolated : false
port: dpdk0 queue-id: 4 (enabled) pmd usage: 13 %
port: dpdk0 queue-id: 6 (enabled) pmd usage: 13 %
port: vhu-vm2p1 queue-id: 0 (enabled) pmd usage: 9 %
port: vhu-vm2p1 queue-id: 6 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 0 core_id 20:
isolated : false
port: vhu-vm1p1 queue-id: 0 (enabled) pmd usage: 11 %
port: vhu-vm1p1 queue-id: 7 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 1 core_id 21:
isolated : false
port: dpdk0 queue-id: 5 (enabled) pmd usage: 13 %
port: dpdk0 queue-id: 7 (enabled) pmd usage: 13 %
port: vhu-vm2p1 queue-id: 1 (enabled) pmd usage: 10 %
port: vhu-vm2p1 queue-id: 7 (enabled) pmd usage: 0 %
overhead: 0 %
2) Software: OVS 2.16 + DPDK 20.11.4 (cross-numa-polling patch applied)
------------------------------------
2.1) UDP over VLAN
--------------------
UDP over VLAN
pmd thread numa_id 0 core_id 2:
isolated : false
port: dpdk0 queue-id: 5 (enabled) pmd usage: 11 %
port: vhu-vm1p1 queue-id: 3 (enabled) pmd usage: 4 %
port: vhu-vm1p1 queue-id: 5 (enabled) pmd usage: 4 %
overhead: 2 %
pmd thread numa_id 1 core_id 3:
isolated : false
port: dpdk0 queue-id: 2 (enabled) pmd usage: 10 %
port: vhu-vm2p1 queue-id: 0 (enabled) pmd usage: 4 %
port: vhu-vm2p1 queue-id: 6 (enabled) pmd usage: 4 %
overhead: 3 %
pmd thread numa_id 0 core_id 4:
isolated : false
port: dpdk0 queue-id: 6 (enabled) pmd usage: 11 %
port: vhu-vm1p1 queue-id: 0 (enabled) pmd usage: 4 %
port: vhu-vm1p1 queue-id: 6 (enabled) pmd usage: 4 %
overhead: 3 %
pmd thread numa_id 1 core_id 5:
isolated : false
port: dpdk0 queue-id: 3 (enabled) pmd usage: 10 %
port: vhu-vm2p1 queue-id: 2 (enabled) pmd usage: 4 %
port: vhu-vm2p1 queue-id: 4 (enabled) pmd usage: 4 %
overhead: 2 %
pmd thread numa_id 0 core_id 18:
isolated : false
port: dpdk0 queue-id: 4 (enabled) pmd usage: 10 %
port: vhu-vm1p1 queue-id: 2 (enabled) pmd usage: 4 %
port: vhu-vm1p1 queue-id: 4 (enabled) pmd usage: 4 %
overhead: 5 %
pmd thread numa_id 1 core_id 19:
isolated : false
port: dpdk0 queue-id: 1 (enabled) pmd usage: 10 %
port: vhu-vm2p1 queue-id: 3 (enabled) pmd usage: 4 %
port: vhu-vm2p1 queue-id: 5 (enabled) pmd usage: 4 %
overhead: 2 %
pmd thread numa_id 0 core_id 20:
isolated : false
port: dpdk0 queue-id: 7 (enabled) pmd usage: 10 %
port: vhu-vm1p1 queue-id: 1 (enabled) pmd usage: 4 %
port: vhu-vm1p1 queue-id: 7 (enabled) pmd usage: 4 %
overhead: 4 %
pmd thread numa_id 1 core_id 21:
isolated : false
port: dpdk0 queue-id: 0 (enabled) pmd usage: 10 %
port: vhu-vm2p1 queue-id: 1 (enabled) pmd usage: 4 %
port: vhu-vm2p1 queue-id: 7 (enabled) pmd usage: 4 %
overhead: 2 %
2.2) UDP over VxLAN
--------------------
pmd thread numa_id 0 core_id 2:
isolated : false
port: dpdk0 queue-id: 6 (enabled) pmd usage: 20 %
port: vhu-vm1p1 queue-id: 5 (enabled) pmd usage: 7 %
port: vhu-vm1p1 queue-id: 6 (enabled) pmd usage: 7 %
overhead: 3 %
pmd thread numa_id 1 core_id 3:
isolated : false
port: dpdk0 queue-id: 2 (enabled) pmd usage: 19 %
port: vhu-vm2p1 queue-id: 4 (enabled) pmd usage: 7 %
port: vhu-vm2p1 queue-id: 5 (enabled) pmd usage: 7 %
overhead: 4 %
pmd thread numa_id 0 core_id 4:
isolated : false
port: dpdk0 queue-id: 4 (enabled) pmd usage: 20 %
port: vhu-vm1p1 queue-id: 2 (enabled) pmd usage: 7 %
port: vhu-vm1p1 queue-id: 4 (enabled) pmd usage: 7 %
overhead: 4 %
pmd thread numa_id 1 core_id 5:
isolated : false
port: dpdk0 queue-id: 3 (enabled) pmd usage: 19 %
port: vhu-vm2p1 queue-id: 1 (enabled) pmd usage: 7 %
port: vhu-vm2p1 queue-id: 6 (enabled) pmd usage: 7 %
overhead: 3 %
pmd thread numa_id 0 core_id 18:
isolated : false
port: dpdk0 queue-id: 1 (enabled) pmd usage: 19 %
port: vhu-vm1p1 queue-id: 0 (enabled) pmd usage: 6 %
port: vhu-vm1p1 queue-id: 3 (enabled) pmd usage: 6 %
overhead: 4 %
pmd thread numa_id 1 core_id 19:
isolated : false
port: dpdk0 queue-id: 5 (enabled) pmd usage: 19 %
port: vhu-vm2p1 queue-id: 2 (enabled) pmd usage: 7 %
port: vhu-vm2p1 queue-id: 3 (enabled) pmd usage: 6 %
overhead: 3 %
pmd thread numa_id 0 core_id 20:
isolated : false
port: dpdk0 queue-id: 7 (enabled) pmd usage: 19 %
port: vhu-vm1p1 queue-id: 1 (enabled) pmd usage: 6 %
port: vhu-vm1p1 queue-id: 7 (enabled) pmd usage: 6 %
overhead: 4 %
pmd thread numa_id 1 core_id 21:
isolated : false
port: dpdk0 queue-id: 0 (enabled) pmd usage: 19 %
port: vhu-vm2p1 queue-id: 0 (enabled) pmd usage: 7 %
port: vhu-vm2p1 queue-id: 7 (enabled) pmd usage: 7 %
overhead: 3 %
eaaanug@IN-00230946:/mnt/c/users/eaaanug/code/ovs-upstream/ovs-2.16.2$ git diff
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d6bee2a5a..82794fd12 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -399,6 +399,7 @@ struct dp_netdev_port {
bool emc_enabled; /* If true EMC will be used. */
char *type; /* Port type as requested by user. */
char *rxq_affinity_list; /* Requested affinity of rx queues. */
+ bool cross_numa_polling; /* If true cross polling will be enabled */
};
static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
@@ -4436,6 +4437,7 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t
port_no,
int error = 0;
const char *affinity_list = smap_get(cfg, "pmd-rxq-affinity");
bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
+ bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling", false);
ovs_mutex_lock(&dp->port_mutex);
error = get_port_by_number(dp, port_no, &port);
@@ -4443,6 +4445,11 @@ dpif_netdev_port_set_config(struct dpif *dpif,
odp_port_t port_no,
goto unlock;
}
+ if (cross_numa_polling != port->cross_numa_polling) {
+ port->cross_numa_polling = cross_numa_polling;
+ dp_netdev_request_reconfigure(dp);
+ }
+
if (emc_enabled != port->emc_enabled) {
struct dp_netdev_pmd_thread *pmd;
struct ds ds = DS_EMPTY_INITIALIZER;
@@ -5257,7 +5264,7 @@ sched_numa_list_schedule(struct sched_numa_list
*numa_list,
{
struct dp_netdev_port *port;
struct dp_netdev_rxq **rxqs = NULL;
- struct sched_numa *last_cross_numa;
+ struct sched_numa *next_numa = NULL;
unsigned n_rxqs = 0;
bool start_logged = false;
size_t n_numa;
@@ -5341,7 +5348,7 @@ sched_numa_list_schedule(struct sched_numa_list
*numa_list,
qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
}
- last_cross_numa = NULL;
+ next_numa = NULL;
n_numa = sched_numa_list_count(numa_list);
for (unsigned i = 0; i < n_rxqs; i++) {
struct dp_netdev_rxq *rxq = rxqs[i];
@@ -5361,20 +5368,25 @@ sched_numa_list_schedule(struct sched_numa_list
*numa_list,
proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST);
/* Select the numa that should be used for this rxq. */
numa_id = netdev_get_numa_id(rxq->port->netdev);
- numa = sched_numa_list_lookup(numa_list, numa_id);
+
+ if (!(rxqs[i]->port->cross_numa_polling)) {
+ /* Try to find a local pmd. */
+ numa = sched_numa_list_lookup(numa_list, numa_id);
+ } else {
+ /* Allow polling by any pmd. */
+ numa = NULL;
+ }
/* Check if numa has no PMDs or no non-isolated PMDs. */
if (!numa || !sched_numa_noniso_pmd_count(numa)) {
/* Unable to use this numa to find a PMD. */
- numa = NULL;
/* Find any numa with available PMDs. */
for (int j = 0; j < n_numa; j++) {
- numa = sched_numa_list_next(numa_list, last_cross_numa);
- if (sched_numa_noniso_pmd_count(numa)) {
+ next_numa = sched_numa_list_next(numa_list, next_numa);
+ if (sched_numa_noniso_pmd_count(next_numa)) {
+ numa = next_numa;
break;
}
- last_cross_numa = numa;
- numa = NULL;
}
}
@@ -5400,6 +5412,8 @@ sched_numa_list_schedule(struct sched_numa_list
*numa_list,
sched_pmd_add_rxq(sched_pmd, rxq, proc_cycles);
}
}
+
+
if (!sched_pmd) {
VLOG(level == VLL_DBG ? level : VLL_WARN,
"No non-isolated pmd on any numa available for "_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev