Hello Kevin,
Thanks for your inputs.
In this scenario we have one VM each on NUMA0 and NUMA1 (VM1 is on NUMA0, VM2
is on NUMA1), dpdk port is on NUMA1.
Without cross-numa-polling, VM/VHU queue traffic is evenly distributed based on
load on their respective NUMA sockets.
However, DPDK traffic is only load balanced on NUMA1 PMDs, thereby exhibiting
aggregate load imbalance in the system (i.e.NUMA1 PMDs having more load v/s
NUMA0 PMDs)
Please refer example below (cross-numa-polling is not enabled)
pmd thread numa_id 0 core_id 2:
isolated : false
port: vhu-vm1p1 queue-id: 2 (enabled) pmd usage: 11 %
port: vhu-vm1p1 queue-id: 4 (enabled) pmd usage: 0 %
overhead: 0 %
pmd thread numa_id 1 core_id 3:
isolated : false
port: dpdk0 queue-id: 0 (enabled) pmd usage: 13 %
port: dpdk0 queue-id: 2 (enabled) pmd usage: 15 %
port: vhu-vm2p1 queue-id: 3 (enabled) pmd usage: 9 %
port: vhu-vm2p1 queue-id: 4 (enabled) pmd usage: 0 %
overhead: 0 %
With cross-numa-polling enabled, the rxqs from DPDK port are distributed to
both NUMAs, and then the 'group' scheduling algorithm assigns the rxqs to PMDs
based on load.
Please refer example below, after cross-numa-polling is enabled on dpdk0 port.
pmd thread numa_id 0 core_id 2:
isolated : false
port: dpdk0 queue-id: 5 (enabled) pmd usage: 11 %
port: vhu-vm1p1 queue-id: 3 (enabled) pmd usage: 4 %
port: vhu-vm1p1 queue-id: 5 (enabled) pmd usage: 4 %
overhead: 2 %
pmd thread numa_id 1 core_id 3:
isolated : false
port: dpdk0 queue-id: 2 (enabled) pmd usage: 10 %
port: vhu-vm2p1 queue-id: 0 (enabled) pmd usage: 4 %
port: vhu-vm2p1 queue-id: 6 (enabled) pmd usage: 4 %
overhead: 3 %
Regards,
Anurag
-----Original Message-----
From: Kevin Traynor <[email protected]>
Sent: Thursday, March 10, 2022 11:02 PM
To: Anurag Agarwal <[email protected]>; Jan Scheurich
<[email protected]>; Wan Junjie <[email protected]>
Cc: [email protected]
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to assign pmd
rxq to all numas
On 04/03/2022 17:57, Anurag Agarwal wrote:
> Hello Kevin,
> I have prepared a patch for "per port cross-numa-polling" and
> attached herewith.
>
> The results are captured in 'cross-numa-results.txt'. We see PMD to RxQ
> assignment evenly balanced across all PMDs with this patch.
>
> Please take a look and let us know your inputs.
>
Hi Anurag,
I think what this is showing is more related to txqs used for sending to the
VM. As you are allowing the rxqs from the phy port to be handled by more pmds,
and all those rxqs have traffic then in turn more txqs are used for sending to
the VM. The result of using more txqs when sending to the VM in this case is
that the traffic is returned on the more rxqs.
Allowing cross-numa does not guarantee that the different pmd cores will poll
rxqs from an interface. At least with group algorithm, the pmds will be
selected purely on load. The right way to ensure that all VM
txqs(/rxqs) are used is to enable the Tx-steering feature [0].
So you might be seeing some benefit in this case, but to me it's not the core
use case of cross-numa polling. That is more about allowing the pmds on every
numa to be used when the traffic load is primarily coming from one numa.
Kevin.
[0]
https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-623c75dfcb975446&q=1&e=fda391fe-6bfc-4657-ba86-b13008b338fd&u=https%3A%2F%2Fdocs.openvswitch.org%2Fen%2Flatest%2Ftopics%2Fuserspace-tx-steering%2F
> Please find some of my inputs inline, in response to your comments.
>
> Regards,
> Anurag
>
> -----Original Message-----
> From: Kevin Traynor <[email protected]>
> Sent: Thursday, February 24, 2022 7:54 PM
> To: Jan Scheurich <[email protected]>; Wan Junjie
> <[email protected]>
> Cc: [email protected]; Anurag Agarwal <[email protected]>
> Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to
> assign pmd rxq to all numas
>
> Hi Jan,
>
> On 17/02/2022 14:21, Jan Scheurich wrote:
>> Hi Kevin,
>>
>>>> We have done extensive benchmarking and found that we get better
>>>> overall
>>> PMD load balance and resulting OVS performance when we do not
>>> statically pin any rx queues and instead let the auto-load-balancing
>>> find the optimal distribution of phy rx queues over both NUMA nodes
>>> to balance an asymmetric load of vhu rx queues (polled only on the local
>>> NUMA node).
>>>>
>>>> Cross-NUMA polling of vhu rx queues comes with a very high latency
>>>> cost due
>>> to cross-NUMA access to volatile virtio ring pointers in every
>>> iteration (not only when actually copying packets). Cross-NUMA
>>> polling of phy rx queues doesn't have a similar issue.
>>>>
>>>
>>> I agree that for vhost rxq polling, it always causes a performance
>>> penalty when there is cross-numa polling.
>>>
>>> For polling phy rxq, when phy and vhost are in different numas, I
>>> don't see any additional penalty for cross-numa polling the phy rxq.
>>>
>>> For the case where phy and vhost are both in the same numa, if I
>>> change to poll the phy rxq cross-numa, then I see about a >20% tput
>>> drop for traffic from phy -
>>>> vhost. Are you seeing that too?
>>
>> Yes, but the performance drop is mostly due to the extra cost of copying the
>> packets across the UPI bus to the virtio buffers on the other NUMA, not
>> because of polling the phy rxq on the other NUMA.
>>
>
> Just to be clear, phy and vhost are on the same numa in my test. I see the
> drop when polling the phy rxq with a pmd from a different numa.
>
>>>
>>> Also, the fact that a different numa can poll the phy rxq after
>>> every rebalance means that the ability of the auto-load-balancer to
>>> estimate and trigger a rebalance is impacted.
>>
>> Agree, there is some inaccuracy in the estimation of the load a phy rx queue
>> creates when it is moved to another NUMA node. So far we have not seen that
>> as a practical problem.
>>
>>>
>>> It seems like simple pinning some phy rxqs cross-numa would avoid
>>> all the issues above and give most of the benefit of cross-numa polling for
>>> phy rxqs.
>>
>> That is what we have done in the past (far a lack of alternatives). But any
>> static pinning reduces the ability of the auto-load balancer to do its job.
>> Consider the following scenarios:
>>
>> 1. The phy ingress traffic is not evenly distributed by RSS due to lack of
>> entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or
>> MPLSoGRE encapsulated traffic).
>>
>> 2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu
>> ports are all on NUMA 0.
>>
>> In all such scenarios, static pinning of phy rxqs may lead to unnecessarily
>> uneven PMD load and loss of overall capacity.
>>
>
> I agree that static pinning may cause a bottleneck if you have more than one
> rx pinned on a core. On the flip side, pinning removes uncertainty about the
> ability of OVS to make good assignments and ALB.
> [Anurag] Echoing what Jan said, static pinning wouldn't allow rebalancing in
> case the traffic across DPDK and VHU queues is asymmetric. With the
> introduction of per port cross-numa-polling, the user has one more option in
> his tool box, to allow full auto load balancing without worrying at all about
> the rxq to PMD assignments. This also makes the deployment of OVS much
> simpler. The user only now needs to provide the list of CPUs, enable AUTO LB
> and cross-numa-polling (in case necessary). All of the rest is handled in
> software.
>
>>>
>>> With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS
>>> could still assign other rxqs to those cores which have with pinned
>>> phy rxqs and properly adjust the assignments based on the load from the
>>> pinned rxqs.
>>
>> Yes, sometimes the vhu rxq load is distributed such that it can be use to
>> balance the PMD, but not always. Sometimes the balance is just better when
>> phy rxqs are not pinned.
>>
>>>
>>> New assignments or auto-load-balance would not change the numa
>>> polling those rxqs, so it it would have no impact to ALB or ability
>>> to assign based on load.
>>
>> In our practical experience the new "group" algorithm for load-based rxq
>> distribution is able to balance the PMD load best when none of the rxqs are
>> pinned and cross-NUMA polling of phy rxqs is enabled. So the effect of the
>> prediction error when doing auto-lb dry-runs cannot be significant.
>>
>
> It could definitely be significant in some cases but it depends on a lot of
> factors to know that.
>
>> In our experience we consistently get the best PMD balance and OVS
>> throughput when we give the auto-lb free hands (no cross-NUMA polling of vhu
>> rxqs, through).
>>
>> BR, Jan
>
> Thanks for sharing your experience with it. My fear with the proposal is that
> someone turns this on and then tells us performance is worse and/or OVS
> assignments/ALB are broken, because it has an impact on their case.
> [Anurag] We have run tests with the per port cross-numa patch, please find
> the results to attached. We have more detailed results available for a 2 Core
> and 4 Core OVS/PMD resource allocation (i.e.4 PMDs and 8 PMDs available for
> OVS respectively). The ALB algorithm was able to load balance and distribute
> rxq to PMD evenly for both UDP over VLAN and UDP over VxLAN traffic, and also
> when combined with other features such as security group.
>
> In terms of limiting possible negative effects,
> - it can be opt-in and recommended only for phy ports [Anurag] I
> believe this might be a reasonable approach. Patch for this is attached for
> your reference.
>
> - could print a warning when it is enabled [Anurag] Might be a
> reasonable thing to do. There seems to be already some logging to indicate
> warning, when a rxq is polled by non-local NUMA PMD.
> - ALB is currently disabled with cross-numa polling (except a limited
> case) but it's clear you want to remove that restriction too [Anurag]
> Yes. We exercise cross-numa-polling with 'group' scheduling and PMD auto-lb
> enabled today, in our solution, and it would be nice to support this with OVS
> master as well.
> - for ALB, a user could increase the improvement threshold to account
> for any reassignments triggered by inaccuracies
>
>
> There is also some improvements that can be made to the proposed
> method when used with group assignment,
> - we can prefer local numa where there is no difference between pmd
> cores. (e.g. two unused cores available, pick the local numa one)
> - we can flatten the list of pmds, so best pmd can be selected. This will
> remove issues with RR numa when there are different num of pmd cores or loads
> per numa.
> - I wrote an RFC that does these two items, I can post when(/if!)
> consensus is reached on the broader topic
>
> In summary, it's a trade-off,
>
> With no cross-numa polling (current):
> - won't have any impact to OVS assignment or ALB accuracy
> - there could be a bottleneck on one numa pmds while other numa pmd
> cores are idle and unused
>
> With cross-numa rx pinning (current):
> - will have access to pmd cores on all numas
> - may require more cycles for some traffic paths
> - won't have any impact to OVS assignment or ALB accuracy
> - >1 pinned rxqs per core may cause a bottleneck depending on traffic
>
> With cross-numa interface setting (proposed):
> - will have access to all pmd cores on all numas (i.e. no unused pmd
> cores during highest load)
> - will require more cycles for some traffic paths
> - will impact on OVS assignment and ALB accuracy
>
> Anything missing above, or is it a reasonable summary?
>
> [Anurag] Seems like a good summary to me. Thanks Kevin.
>
> thanks,
> Kevin.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev