Hi Kevin, Thanks for your reply.
Sorry about forgetting the cc, repost it. On Sat, Feb 12, 2022 at 1:09 AM Kevin Traynor <[email protected]> wrote: > > Hi Wan Junjie, > > On 27/01/2022 11:43, Wan Junjie wrote: > > When assign a rxq to a pmd, the rxq will only be assigned to > > the numa it belongs. An exception is that the numa has no non-iso > > pmd there. > > > > For example, we have one vm port (vhu) with all its rxqs in numa1, > > while one phy port (p0) in numa0 with all its rxqs in numa 0. > > we have four pmds in two numas. > > See tables below, paras are listed as (port, queue, numa, core) > > p0 0 0 20 > > p0 1 0 21 > > p0 2 0 20 > > p0 3 0 21 > > vhu 0 1 40 > > vhu 1 1 41 > > vhu 2 1 40 > > vhu 2 1 41 > > In this scenario, ovs-dpdk underperforms another setting that > > all p0's queues pinned to four pmds using pmd-rxq-affinity. With > > pmd-rxq-isolate=false, we can make vhu get polled on numa 1. > > p0 0 0 20 > > p0 1 1 40 > > p0 2 0 21 > > p0 3 1 41 > > vhu 0 1 40 > > vhu 1 1 41 > > vhu 2 1 40 > > vhu 2 1 41 > > Then we found that with really high throughput, say pmd 40, 41 > > will reach 100% cycles busy, and pmd 20, 21 reach 70% cycles > > busy. The unbalanced rxqs on these pmds became the bottle neck. > > > > OTOH, you could also look at it that you are contributing to that > overload by pinning p0 to those cores. So the partial pinning you are > doing here is not a good solution for your config. If you had of > continued to pin all the queues, then you would have had a better solution. > For a host, usually it has only one phy port taken by ovs and the numa it uses can be determined while initializing. On the other hand, we can have multiple vhu ports from several vms. The numa of the vhu ports could be random as the vms could use huge pages from different numas. Partial pinning p0 to all pmd cores could be of benefit to ovs performance as p0 is being polled by more threads. If not, the receiver side ovs could only receive half of the throughput. The main idea of pinning p0 to all cores is to get it polled more. Even if vhu did not get pinned cross numa. This setting is good when the traffic is not that high. Only when the traffic goes really high, the bottle neck comes out. And we did see it happen. This patch provides an option for high traffic on all threads. > > pmd-rxq-numa=false would ignore the rxq's numa attribute when > > assign rxq to pmds. This could make full use of the cores from > > different numas while we have limited core resources for pmds. > > > > Yes, with this config and this traffic, it is beneficial. The challenge > is trying to also make it not cause performance regressions for other > configs. Or at least having some control about that. > > With the code below (and the user enabling) there is no preference for > local-numa polling, so it can be always local-numa or cross-numa. Have > you tested the performance drop that will occur if cross-numa polling is > selected instead of local-numa for different cases? It can be quite > large, like 40%. > Yes, cross-numa could harm the throughput. I did some tests like putting phy port, vhu port and pmd on two numas. See combination and result below. ( ':' is the delimiter for numas) p0 : PMD + vhu | PMD + p0 : vhu | p0 + vhu : PMD | PMD + p0 + vhu 11G | 9.7G | 8.2G. | 14.2G The DUT is the receiver side. This data comes from iperf with one flow. For multiple flows, in all cases, throughput can reach 22.8G the line rate limits (25G nic). For the PPS test, the results did not have too many gaps. All around 5Mpps. In another test, with 100G *2 bonding. cross-numa multiple flows test has the same result with local-numa. In a real scenario, maybe the multiple flows' result and PPS's data are much more important. Another thing needs to be mentioned. >From the test we can see that when vm and p0 are on the same numa, it will have better performance than when they are on different numas. So two vms from different numas could have different performance data. This is eveitible. > Another issue is that if you allow polling to switch numa commonly you > also make estimates for assignments less reliable. I don't see any good > way around this, but there might still be net benefit to lose some > accuracy in order to be able to utilize all the cores. Especially if you > consider worst cases where one numa pmds are overloaded and another numa > pmds are idle. > > Btw, this patch is similar in functionality to the one posted by Anurag > [0] and there was also some discussion about this approach here [1]. > Thanks for pointing this out. IMO, setting interface cross-numa would be good for phy port but not good for vhu. Since vhu can be destroyed and created relatively frequently. But yes the main idea is the same. > Another option is to only use cross-numa cores when local-numa ones are > overloaded. That way we can get the benefit from local-numa polling > for performance and also utilize cross-numa under overload conditions. > Though it comes with the expense of complexity and defining what > overloaded is etc. I have a an RFC to do that, I will post for > discussion once I clean it up. > > thanks, > Kevin. > > [0] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html > [1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385695.html > That would be great if they can be polled by this order. Thank you, Wan Junjie > > Signed-off-by: Wan Junjie <[email protected]> > > --- > > lib/dpif-netdev.c | 30 +++++++++--- > > tests/pmd.at | 106 +++++++++++++++++++++++++++++++++++++++++++ > > vswitchd/vswitch.xml | 19 +++++++- > > 3 files changed, 148 insertions(+), 7 deletions(-) > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > > index e28e0b554..6b54a3a4c 100644 > > --- a/lib/dpif-netdev.c > > +++ b/lib/dpif-netdev.c > > @@ -297,6 +297,7 @@ struct dp_netdev { > > struct ovs_mutex tx_qid_pool_mutex; > > /* Rxq to pmd assignment type. */ > > enum sched_assignment_type pmd_rxq_assign_type; > > + bool pmd_rxq_assign_numa; > > bool pmd_iso; > > > > /* Protects the access of the 'struct dp_netdev_pmd_thread' > > @@ -1844,6 +1845,7 @@ create_dp_netdev(const char *name, const struct > > dpif_class *class, > > > > cmap_init(&dp->poll_threads); > > dp->pmd_rxq_assign_type = SCHED_CYCLES; > > + dp->pmd_rxq_assign_numa = true; > > > > ovs_mutex_init(&dp->tx_qid_pool_mutex); > > /* We need 1 Tx queue for each possible core + 1 for non-PMD threads. > > */ > > @@ -4867,6 +4869,14 @@ dpif_netdev_set_config(struct dpif *dpif, const > > struct smap *other_config) > > dp_netdev_request_reconfigure(dp); > > } > > > > + bool rxq_assign_numa = smap_get_bool(other_config, "pmd-rxq-numa", > > true); > > + if (dp->pmd_rxq_assign_numa != rxq_assign_numa) { > > + dp->pmd_rxq_assign_numa = rxq_assign_numa; > > + VLOG_INFO("Rxq to PMD numa assignment changed to: \'%s\'.", > > + rxq_assign_numa ? "numa enabled": "numa disabled"); > > + dp_netdev_request_reconfigure(dp); > > + } > > + > > struct pmd_auto_lb *pmd_alb = &dp->pmd_alb; > > > > rebalance_intvl = smap_get_int(other_config, > > "pmd-auto-lb-rebal-interval", > > @@ -5869,6 +5879,7 @@ static void > > sched_numa_list_schedule(struct sched_numa_list *numa_list, > > struct dp_netdev *dp, > > enum sched_assignment_type algo, > > + bool assign_numa, > > enum vlog_level level) > > OVS_REQ_RDLOCK(dp->port_rwlock) > > { > > @@ -5980,14 +5991,19 @@ sched_numa_list_schedule(struct sched_numa_list > > *numa_list, > > numa_id = netdev_get_numa_id(rxq->port->netdev); > > numa = sched_numa_list_lookup(numa_list, numa_id); > > > > - /* Check if numa has no PMDs or no non-isolated PMDs. */ > > - if (!numa || !sched_numa_noniso_pmd_count(numa)) { > > + /* Check if numa has no PMDs or no non-isolated PMDs. > > + Or if we should ignore the pmd numa attribute */ > > + if (!numa || !sched_numa_noniso_pmd_count(numa) || !assign_numa) { > > /* Unable to use this numa to find a PMD. */ > > numa = NULL; > > - /* Find any numa with available PMDs. */ > > + /* If we ignore rxq pmd numa attribute and it's roundrobin, we > > + should do roundrobin, else find any numa with available > > PMDs. */ > > for (int j = 0; j < n_numa; j++) { > > numa = sched_numa_list_next(numa_list, last_cross_numa); > > if (sched_numa_noniso_pmd_count(numa)) { > > + if (!assign_numa) { > > + last_cross_numa = numa; > > + } > > break; > > } > > last_cross_numa = numa; > > @@ -5996,7 +6012,7 @@ sched_numa_list_schedule(struct sched_numa_list > > *numa_list, > > } > > > > if (numa) { > > - if (numa->numa_id != numa_id) { > > + if (numa->numa_id != numa_id && assign_numa) { > > VLOG(level, "There's no available (non-isolated) pmd > > thread " > > "on numa node %d. Port \'%s\' rx queue %d > > will " > > "be assigned to a pmd on numa node %d. " > > @@ -6036,9 +6052,10 @@ rxq_scheduling(struct dp_netdev *dp) > > { > > struct sched_numa_list numa_list; > > enum sched_assignment_type algo = dp->pmd_rxq_assign_type; > > + bool assign_numa = dp->pmd_rxq_assign_numa; > > > > sched_numa_list_populate(&numa_list, dp); > > - sched_numa_list_schedule(&numa_list, dp, algo, VLL_INFO); > > + sched_numa_list_schedule(&numa_list, dp, algo, assign_numa, VLL_INFO); > > sched_numa_list_put_in_place(&numa_list); > > > > sched_numa_list_free_entries(&numa_list); > > @@ -6164,7 +6181,8 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > > /* Populate estimated assignments. */ > > sched_numa_list_populate(&numa_list_est, dp); > > sched_numa_list_schedule(&numa_list_est, dp, > > - dp->pmd_rxq_assign_type, VLL_DBG); > > + dp->pmd_rxq_assign_type, > > + dp->pmd_rxq_assign_numa, VLL_DBG); > > > > /* Check if cross-numa polling, there is only one numa with PMDs. */ > > if (!sched_numa_list_cross_numa_polling(&numa_list_est) || > > diff --git a/tests/pmd.at b/tests/pmd.at > > index a2f9d34a2..4ef5b6886 100644 > > --- a/tests/pmd.at > > +++ b/tests/pmd.at > > @@ -317,6 +317,48 @@ pmd thread numa_id 0 core_id 2: > > isolated : false > > ]) > > > > +# disable pmd-rxq-numa for roundrobin > > +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > +AT_CHECK([ovs-vsctl set open . other_config:pmd-rxq-numa=false]) > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "Rxq to PMD numa > > assignment changed to: 'numa disabled'"]) > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show], [0], [dnl > > +pmd thread numa_id 1 core_id 1: > > + isolated : false > > + port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 3 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 5 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 7 (enabled) pmd usage: NOT AVAIL > > + overhead: NOT AVAIL > > +pmd thread numa_id 0 core_id 2: > > + isolated : false > > + port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 4 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 6 (enabled) pmd usage: NOT AVAIL > > + overhead: NOT AVAIL > > +]) > > + > > +# enable pmd-rxq-numa again for roundrobin > > +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > +AT_CHECK([ovs-vsctl set open . other_config:pmd-rxq-numa=true]) > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "Rxq to PMD numa > > assignment changed to: 'numa enabled'"]) > > + > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show], [0], [dnl > > +pmd thread numa_id 1 core_id 1: > > + isolated : false > > + port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 3 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 4 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 5 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 6 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 7 (enabled) pmd usage: NOT AVAIL > > + overhead: NOT AVAIL > > +pmd thread numa_id 0 core_id 2: > > + isolated : false > > +]) > > + > > TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > AT_CHECK([ovs-vsctl set Open_vSwitch . > > other_config:pmd-rxq-assign=cycles]) > > OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "Performing pmd to > > rx queue assignment using cycles algorithm"]) > > @@ -337,6 +379,50 @@ pmd thread numa_id 0 core_id 2: > > isolated : false > > ]) > > > > + > > +# disable pmd-rxq-numa for cycles > > +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > +AT_CHECK([ovs-vsctl set open . other_config:pmd-rxq-numa=false]) > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "Rxq to PMD numa > > assignment changed to: 'numa disabled'"]) > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show], [0], [dnl > > +pmd thread numa_id 1 core_id 1: > > + isolated : false > > + port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 3 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 5 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 7 (enabled) pmd usage: NOT AVAIL > > + overhead: NOT AVAIL > > +pmd thread numa_id 0 core_id 2: > > + isolated : false > > + port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 4 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 6 (enabled) pmd usage: NOT AVAIL > > + overhead: NOT AVAIL > > +]) > > + > > +# enable pmd-rxq-numa again for cycles > > +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > +AT_CHECK([ovs-vsctl set open . other_config:pmd-rxq-numa=true]) > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "Rxq to PMD numa > > assignment changed to: 'numa enabled'"]) > > + > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show], [0], [dnl > > +pmd thread numa_id 1 core_id 1: > > + isolated : false > > + port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 3 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 4 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 5 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 6 (enabled) pmd usage: NOT AVAIL > > + port: p0 queue-id: 7 (enabled) pmd usage: NOT AVAIL > > + overhead: NOT AVAIL > > +pmd thread numa_id 0 core_id 2: > > + isolated : false > > +]) > > + > > + > > # Switch back from mixed numa to single numa > > TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > > AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x1]) > > @@ -925,6 +1011,19 @@ p1 0 0 1 > > p1 1 0 2 > > ]) > > > > +dnl for rxq affinity the disable numa assignment should not work > > +AT_CHECK([ovs-vsctl set open . other_config:pmd-rxq-numa=false]) > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show], [0], > > [dnl > > +p1 0 0 1 > > +p1 1 0 2 > > +]) > > + > > +AT_CHECK([ovs-vsctl set open . other_config:pmd-rxq-numa=true]) > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show], [0], > > [dnl > > +p1 0 0 1 > > +p1 1 0 2 > > +]) > > + > > AT_CHECK([ovs-vsctl set Interface p1 > > other_config:pmd-rxq-affinity="0:3,1:4"]) > > > > dnl We moved the queues to different contiguous numa node. Expecting > > threads on > > @@ -934,6 +1033,13 @@ p1 0 1 3 > > p1 1 1 4 > > ]) > > > > +dnl for rxq affinity the disable numa assignment should not work > > +AT_CHECK([ovs-vsctl set open . other_config:pmd-rxq-numa=false]) > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show], [0], > > [dnl > > +p1 0 1 3 > > +p1 1 1 4 > > +]) > > + > > AT_CHECK([ovs-vsctl set Interface p1 > > other_config:pmd-rxq-affinity="0:5,1:6"]) > > > > dnl We moved the queues to different non-contiguous numa node. Expecting > > threads on > > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > > index 064e0facf..1e7c00746 100644 > > --- a/vswitchd/vswitch.xml > > +++ b/vswitchd/vswitch.xml > > @@ -576,7 +576,24 @@ > > is set to <code>group</code>. > > </p> > > </column> > > - > > + <column name="other_config" key="pmd-rxq-numa" > > + type='{"type": "boolean"}'> > > + <p> > > + Configures if Rx queues will be assigned to cores taken numa into > > + consideration. > > + </p> > > + <p> > > + It uses current scheme of cycle based assignment of RX queues that > > + are not statically pinned to PMDs. > > + </p> > > + <p> > > + The default value is <code>true</code>. > > + </p> > > + <p> > > + Set this value to <code>false</code> to disable this option. It > > is > > + currently enabled by default. > > + </p> > > + </column> > > <column name="other_config" key="n-handler-threads" > > type='{"type": "integer", "minInteger": 1}'> > > <p> > > > > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
