Hi Jan,
On 09/03/2022 15:48, Jan Scheurich wrote:
Thanks for sharing your experience with it. My fear with the proposal is that
someone turns this on and then tells us performance is worse and/or OVS
assignments/ALB are broken, because it has an impact on their case.
In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports
- could print a warning when it is enabled
- ALB is currently disabled with cross-numa polling (except a limited
case) but it's clear you want to remove that restriction too
- for ALB, a user could increase the improvement threshold to account for any
reassignments triggered by inaccuracies
[Jan] Yes, we want to enable cross-NUMA polling of selected (typically phy) ports in ALB
"group" mode as an opt-in config option (default off). Based on our
observations we are not too much concerned with the loss of ALB prediction accuracy but
increasing the threshold may be a way of taking that into account, if wanted.
There is also some improvements that can be made to the proposed method
when used with group assignment,
- we can prefer local numa where there is no difference between pmd cores.
(e.g. two unused cores available, pick the local numa one)
- we can flatten the list of pmds, so best pmd can be selected. This will remove
issues with RR numa when there are different num of pmd cores or loads per
numa.
- I wrote an RFC that does these two items, I can post when(/if!) consensus is
reached on the broader topic
[Jan] In our alternative version of the current upstream "group" ALB [1] we
already maintained a flat list of PMDs. So we would support that feature. Using
NUMA-locality as a tie-breaker makes sense.
[1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384546.html
In summary, it's a trade-off,
With no cross-numa polling (current):
- won't have any impact to OVS assignment or ALB accuracy
- there could be a bottleneck on one numa pmds while other numa pmd cores
are idle and unused
With cross-numa rx pinning (current):
- will have access to pmd cores on all numas
- may require more cycles for some traffic paths
- won't have any impact to OVS assignment or ALB accuracy
- >1 pinned rxqs per core may cause a bottleneck depending on traffic
With cross-numa interface setting (proposed):
- will have access to all pmd cores on all numas (i.e. no unused pmd cores
during highest load)
- will require more cycles for some traffic paths
- will impact on OVS assignment and ALB accuracy
Anything missing above, or is it a reasonable summary?
I think that is a reasonable summary, albeit I would have characterized the
third option a bit more positively:
- Gives ALB maximum freedom to balance load of PMDs on all NUMA nodes (in the
likely scenario of uneven VM load on the NUMAs)
- Accepts an increase of cycles on cross-NUMA paths for a better utilization of
a free PMD cycles
- Mostly suitable for phy ports due to limited cycle increase for cross-NUMA
polling of phy rx queues
- Could negatively impact the ALB prediction accuracy in certain scenarios
It's the estimation accuracy during the assignments. That might be be
seen as part of the ALB dry-run, but it could also be during an actual
reconfigure/reassignment itself. e.g. we think pmdx is the lowest loaded
pmd and assign another rxq, only to find out a previously assigned rxq
now requires 30% more cycles after changing numa and it was a bad
selection to add more rxqs. Now there's overload on that pmdx etc.
OTOH, in some cases you are now also getting to utilize more pmds from
other numas, so there is more net compute power. That means less risk of
overload on any pmd, but I'm sure people will still try and push it to
the limits.
I think we're both on the same page regarding functionality, pros and
cons etc. It's just the inaccuracy and likelihood of problems occurring
where we are viewing differently. You are saying you think it's low risk
as that is your experience, while I am a bit more cautious about it.
We will post a new version of our patch [2] for cross-numa polling on selected
ports adapted to the current OVS master shortly.
[2] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html
I think we should give a bit more time for more eyes and any discussion
at the high level before progressing too much on the code, but as you
are talking about it...I mentioned an RFC so I put it up here [0].
I didn't add the user enabling part, that's straightforward, I was just
working on how adapt the current rxq scheduling for not always selecting
a numa first (as it was based like this), flattening numa and a local
numa tiebreaker.
[0] https://github.com/kevintraynor/ovs/commits/crossnuma
thanks,
Kevin.
Thanks, Jan
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev