Hi Jan,

On 09/03/2022 15:48, Jan Scheurich wrote:
Thanks for sharing your experience with it. My fear with the proposal is that
someone turns this on and then tells us performance is worse and/or OVS
assignments/ALB are broken, because it has an impact on their case.

In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports
- could print a warning when it is enabled
- ALB is currently disabled with cross-numa polling (except a limited
case) but it's clear you want to remove that restriction too
- for ALB, a user could increase the improvement threshold to account for any
reassignments triggered by inaccuracies

[Jan] Yes, we want to enable cross-NUMA polling of selected (typically phy) ports in ALB 
"group" mode as an opt-in config option (default off). Based on our 
observations we are not too much concerned with the loss of ALB prediction accuracy but 
increasing the threshold may be a way of taking that into account, if wanted.


There is also some improvements that can be made to the proposed method
when used with group assignment,
- we can prefer local numa where there is no difference between pmd cores.
(e.g. two unused cores available, pick the local numa one)
- we can flatten the list of pmds, so best pmd can be selected. This will remove
issues with RR numa when there are different num of pmd cores or loads per
numa.
- I wrote an RFC that does these two items, I can post when(/if!) consensus is
reached on the broader topic

[Jan] In our alternative version of the current upstream "group" ALB [1] we 
already maintained a flat list of PMDs. So we would support that feature. Using 
NUMA-locality as a tie-breaker makes sense.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384546.html


In summary, it's a trade-off,

With no cross-numa polling (current):
- won't have any impact to OVS assignment or ALB accuracy
- there could be a bottleneck on one numa pmds while other numa pmd cores
are idle and unused

With cross-numa rx pinning (current):
- will have access to pmd cores on all numas
- may require more cycles for some traffic paths
- won't have any impact to OVS assignment or ALB accuracy
- >1 pinned rxqs per core may cause a bottleneck depending on traffic

With cross-numa interface setting (proposed):
- will have access to all pmd cores on all numas (i.e. no unused pmd cores
during highest load)
- will require more cycles for some traffic paths
- will impact on OVS assignment and ALB accuracy

Anything missing above, or is it a reasonable summary?

I think that is a reasonable summary, albeit I would have characterized the 
third option a bit more positively:
- Gives ALB maximum freedom to balance load of PMDs on all NUMA nodes (in the 
likely scenario of uneven VM load on the NUMAs)
- Accepts an increase of cycles on cross-NUMA paths for a better utilization of 
a free PMD cycles
- Mostly suitable for phy ports due to limited cycle increase for cross-NUMA 
polling of phy rx queues
- Could negatively impact the ALB prediction accuracy in certain scenarios


It's the estimation accuracy during the assignments. That might be be seen as part of the ALB dry-run, but it could also be during an actual reconfigure/reassignment itself. e.g. we think pmdx is the lowest loaded pmd and assign another rxq, only to find out a previously assigned rxq now requires 30% more cycles after changing numa and it was a bad selection to add more rxqs. Now there's overload on that pmdx etc.

OTOH, in some cases you are now also getting to utilize more pmds from other numas, so there is more net compute power. That means less risk of overload on any pmd, but I'm sure people will still try and push it to the limits.

I think we're both on the same page regarding functionality, pros and cons etc. It's just the inaccuracy and likelihood of problems occurring where we are viewing differently. You are saying you think it's low risk as that is your experience, while I am a bit more cautious about it.

We will post a new version of our patch [2] for cross-numa polling on selected 
ports adapted to the current OVS master shortly.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html


I think we should give a bit more time for more eyes and any discussion at the high level before progressing too much on the code, but as you are talking about it...I mentioned an RFC so I put it up here [0].

I didn't add the user enabling part, that's straightforward, I was just working on how adapt the current rxq scheduling for not always selecting a numa first (as it was based like this), flattening numa and a local numa tiebreaker.

[0] https://github.com/kevintraynor/ovs/commits/crossnuma

thanks,
Kevin.

Thanks, Jan



_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to