Hi Jan,

On 17/02/2022 14:21, Jan Scheurich wrote:
Hi Kevin,

We have done extensive benchmarking and found that we get better overall
PMD load balance and resulting OVS performance when we do not statically
pin any rx queues and instead let the auto-load-balancing find the optimal
distribution of phy rx queues over both NUMA nodes to balance an asymmetric
load of vhu rx queues (polled only on the local NUMA node).

Cross-NUMA polling of vhu rx queues comes with a very high latency cost due
to cross-NUMA access to volatile virtio ring pointers in every iteration (not 
only
when actually copying packets). Cross-NUMA polling of phy rx queues doesn't
have a similar issue.


I agree that for vhost rxq polling, it always causes a performance penalty when
there is cross-numa polling.

For polling phy rxq, when phy and vhost are in different numas, I don't see any
additional penalty for cross-numa polling the phy rxq.

For the case where phy and vhost are both in the same numa, if I change to poll
the phy rxq cross-numa, then I see about a >20% tput drop for traffic from phy -
vhost. Are you seeing that too?

Yes, but the performance drop is mostly due to the extra cost of copying the 
packets across the UPI bus to the virtio buffers on the other NUMA, not because 
of polling the phy rxq on the other NUMA.


Just to be clear, phy and vhost are on the same numa in my test. I see the drop when polling the phy rxq with a pmd from a different numa.


Also, the fact that a different numa can poll the phy rxq after every rebalance
means that the ability of the auto-load-balancer to estimate and trigger a
rebalance is impacted.

Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
creates when it is moved to another NUMA node. So far we have not seen that as 
a practical problem.


It seems like simple pinning some phy rxqs cross-numa would avoid all the
issues above and give most of the benefit of cross-numa polling for phy rxqs.

That is what we have done in the past (far a lack of alternatives). But any 
static pinning reduces the ability of the auto-load balancer to do its job. 
Consider the following scenarios:

1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
MPLSoGRE encapsulated traffic).

2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
ports are all on NUMA 0.

In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
uneven PMD load and loss of overall capacity.


I agree that static pinning may cause a bottleneck if you have more than one rx pinned on a core. On the flip side, pinning removes uncertainty about the ability of OVS to make good assignments and ALB.


With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS could
still assign other rxqs to those cores which have with pinned phy rxqs and
properly adjust the assignments based on the load from the pinned rxqs.

Yes, sometimes the vhu rxq load is distributed such that it can be use to 
balance the PMD, but not always. Sometimes the balance is just better when phy 
rxqs are not pinned.


New assignments or auto-load-balance would not change the numa polling
those rxqs, so it it would have no impact to ALB or ability to assign based on
load.

In our practical experience the new "group" algorithm for load-based rxq 
distribution is able to balance the PMD load best when none of the rxqs are pinned and 
cross-NUMA polling of phy rxqs is enabled. So the effect of the prediction error when 
doing auto-lb dry-runs cannot be significant.


It could definitely be significant in some cases but it depends on a lot of factors to know that.

In our experience we consistently get the best PMD balance and OVS throughput 
when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs, 
through).

BR, Jan

Thanks for sharing your experience with it. My fear with the proposal is that someone turns this on and then tells us performance is worse and/or OVS assignments/ALB are broken, because it has an impact on their case.

In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports
- could print a warning when it is enabled
- ALB is currently disabled with cross-numa polling (except a limited case) but it's clear you want to remove that restriction too - for ALB, a user could increase the improvement threshold to account for any reassignments triggered by inaccuracies


There is also some improvements that can be made to the proposed method when used with group assignment, - we can prefer local numa where there is no difference between pmd cores. (e.g. two unused cores available, pick the local numa one) - we can flatten the list of pmds, so best pmd can be selected. This will remove issues with RR numa when there are different num of pmd cores or loads per numa. - I wrote an RFC that does these two items, I can post when(/if!) consensus is reached on the broader topic

In summary, it's a trade-off,

With no cross-numa polling (current):
- won't have any impact to OVS assignment or ALB accuracy
- there could be a bottleneck on one numa pmds while other numa pmd cores are idle and unused

With cross-numa rx pinning (current):
- will have access to pmd cores on all numas
- may require more cycles for some traffic paths
- won't have any impact to OVS assignment or ALB accuracy
- >1 pinned rxqs per core may cause a bottleneck depending on traffic

With cross-numa interface setting (proposed):
- will have access to all pmd cores on all numas (i.e. no unused pmd cores during highest load)
- will require more cycles for some traffic paths
- will impact on OVS assignment and ALB accuracy

Anything missing above, or is it a reasonable summary?

thanks,
Kevin.

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to