Hi, Kevin. Thanks for the patch. I didn't look at the code closely, but have a few high level comments.
At first, please, change the patch prefix to 'dpif-netdev' as this patch affects only this module and has no changes for 'netdev-dpdk'. Next, I have a concern about "portqueue" option naming and the docs. Documentation says that "Rxqs are ordered by their port/queue number, and then assigned in ascending order to PMDs". I think that it's a bit misleading, because ports are scheduled in the order in which they are in a hash map. And this order could be different for different compiler flags/cpu architectures because of different hash functions. So, ports will be assigned in round-robin fashion, but we can not actually predict which port will be assigned to which cpu core. Maybe, it'll be better to rename to something like "round-robin" and rewrite the docs to avoid false impression of predictable port to cpu mapping. What do you think? vswitch.xml changes are misplaced. They should go in "Configuration" group of "Open_vSwitch" table, not the "Interface". The last thing is that you need to request for reconfiguration if this option changed. IMHO, user will expect that ports will be re-assigned in case of algorithm change in database. This should be not so frequent event. And you may avoid using atomics in this case. Just a new field in the structure. Best regards, Ilya Maximets. On 21.08.2018 20:15, Kevin Traynor wrote: > Prior to OVS 2.9 automatic assignment of Rxqs to PMDs > (i.e. CPUs) was done by assigning Rxqs in an ascending > port/queue order, round robined across the available > PMDs. > > That was changed in OVS 2.9 to order the Rxqs by the > measured processing cycles for each, in order to try > and keep the busiest Rxqs on different PMDs. > > For the most part the new scheme should be better, but > there could be situations where a user prefers a > port/queue scheme because Rxqs from a single port are > more likely to be spread across multiple cores, and/or > traffic is very bursty/unpredictable. > > Add the option to have a port/queue based assignment. > > Signed-off-by: Kevin Traynor <[email protected]> > --- > Documentation/topics/dpdk/pmd.rst | 34 +++++++++-- > NEWS | 2 + > lib/dpif-netdev.c | 115 > +++++++++++++++++++++++++++----------- > tests/pmd.at | 15 ++++- > vswitchd/vswitch.xml | 23 ++++++++ > 5 files changed, 147 insertions(+), 42 deletions(-) > > diff --git a/Documentation/topics/dpdk/pmd.rst > b/Documentation/topics/dpdk/pmd.rst > index 5f0671e..e8628cc 100644 > --- a/Documentation/topics/dpdk/pmd.rst > +++ b/Documentation/topics/dpdk/pmd.rst > @@ -113,10 +113,15 @@ means that this thread will only poll the *pinned* Rx > queues. > > If ``pmd-rxq-affinity`` is not set for Rx queues, they will be assigned to > PMDs > -(cores) automatically. Where known, the processing cycles that have been > stored > -for each Rx queue will be used to assign Rx queue to PMDs based on a round > -robin of the sorted Rx queues. For example, take the following example, where > -there are five Rx queues and three cores - 3, 7, and 8 - available and the > -measured usage of core cycles per Rx queue over the last interval is seen to > -be: > +(cores) automatically. > + > +The algorithm used to automatically assign Rxqs to PMDs can be set by:: > + > + $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=<assignment> > + > +By default, ``cycles`` assignment is used where the Rxqs will be ordered by > +their measured processing cycles, and then be assigned in descending order to > +PMDs based on a round robin up/down walk of the PMDs. For example, where > there > +are five Rx queues and three cores - 3, 7, and 8 - available and the measured > +usage of core cycles per Rx queue over the last interval is seen to be: > > - Queue #0: 30% > @@ -132,4 +137,21 @@ The Rx queues will be assigned to the cores in the > following order:: > Core 8: Q3 (60%) | Q0 (30%) > > +Alternatively, ``portqueue`` assignment can be used, where the Rxqs are > +ordered by their port/queue number, and then assigned in ascending order to > +PMDs based on a round robin of the PMDs. This algorithm was used by default > +prior to OVS 2.9. For example, given the following ports and queues: > + > +- Port #0 Queue #0 (P0Q0) > +- Port #0 Queue #1 (P0Q1) > +- Port #1 Queue #0 (P1Q0) > +- Port #1 Queue #1 (P1Q1) > +- Port #1 Queue #2 (P1Q2) > + > +The Rx queues will be assigned to the cores in the following order:: > + > + Core 3: P0Q0 | P1Q1 > + Core 7: P0Q1 | P1Q2 > + Core 8: P1Q0 | > + > To see the current measured usage history of PMD core cycles for each Rx > queue:: > diff --git a/NEWS b/NEWS > index 8987f9a..9e809d1 100644 > --- a/NEWS > +++ b/NEWS > @@ -5,4 +5,6 @@ Post-v2.10.0 > - The environment variable OVS_CTL_TIMEOUT, if set, is now used > as the default timeout for control utilities. > + - DPDK: > + * Add option for port/queue based Rxq to PMD assignment. > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 7f836bb..507906c 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -342,4 +342,7 @@ struct dp_netdev { > struct id_pool *tx_qid_pool; > struct ovs_mutex tx_qid_pool_mutex; > + /* Stores whether an rxq's used cycles should be > + * used to influence assignment to pmds. */ > + atomic_bool pmd_rxq_assign_cyc; > > /* Protects the access of the 'struct dp_netdev_pmd_thread' > @@ -1491,4 +1494,5 @@ create_dp_netdev(const char *name, const struct > dpif_class *class, > atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN); > atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL); > + atomic_init(&dp->pmd_rxq_assign_cyc, true); > > cmap_init(&dp->poll_threads); > @@ -3717,4 +3721,6 @@ dpif_netdev_set_config(struct dpif *dpif, const struct > smap *other_config) > struct dp_netdev *dp = get_dp_netdev(dpif); > const char *cmask = smap_get(other_config, "pmd-cpu-mask"); > + const char *pmd_rxq_assign = smap_get_def(other_config, "pmd-rxq-assign", > + "cycles"); > unsigned long long insert_prob = > smap_get_ullong(other_config, "emc-insert-inv-prob", > @@ -3779,4 +3785,20 @@ dpif_netdev_set_config(struct dpif *dpif, const struct > smap *other_config) > } > } > + > + bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles"); > + if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) { > + bool cur_pmd_rxq_assign_cyc; > + atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, > &cur_pmd_rxq_assign_cyc); > + if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) { > + atomic_store_relaxed(&dp->pmd_rxq_assign_cyc, > pmd_rxq_assign_cyc); > + if (pmd_rxq_assign_cyc) { > + VLOG_INFO("Rxq to PMD assignment mode: " > + "Sort queues by processing cycles."); > + } else { > + VLOG_INFO("Rxq to PMD assignment mode: " > + "Sort queues by port/queue number."); > + } > + } > + } > return 0; > } > @@ -4249,27 +4271,40 @@ rr_numa_list_populate(struct dp_netdev *dp, struct > rr_numa_list *rr) > } > > -/* Returns the next pmd from the numa node in > - * incrementing or decrementing order. */ > +/* Returns the next pmd from the numa node. > + * > + * If updown is 'true' it will alternate between > + * selecting the next pmd in either an up or down > + * walk, switching between up/down each time the > + * min or max core is reached. e.g. 1,2,3,3,2,1,1,2... > + * > + * If updown is 'false' it will select the next pmd > + * in ascending order, wrapping around when max core > + * is reached. e.g. 1,2,3,1,2,3,1,2... */ > static struct dp_netdev_pmd_thread * > -rr_numa_get_pmd(struct rr_numa *numa) > +rr_numa_get_pmd(struct rr_numa *numa, bool updown) > { > - int numa_idx = numa->cur_index; > + int numa_idx; > > - if (numa->idx_inc == true) { > - /* Incrementing through list of pmds. */ > - if (numa->cur_index == numa->n_pmds-1) { > - /* Reached the last pmd. */ > - numa->idx_inc = false; > + if (updown) { > + numa_idx = numa->cur_index; > + if (numa->idx_inc == true) { > + /* Incrementing through list of pmds. */ > + if (numa->cur_index == numa->n_pmds-1) { > + /* Reached the last pmd. */ > + numa->idx_inc = false; > + } else { > + numa->cur_index++; > + } > } else { > - numa->cur_index++; > + /* Decrementing through list of pmds. */ > + if (numa->cur_index == 0) { > + /* Reached the first pmd. */ > + numa->idx_inc = true; > + } else { > + numa->cur_index--; > + } > } > } else { > - /* Decrementing through list of pmds. */ > - if (numa->cur_index == 0) { > - /* Reached the first pmd. */ > - numa->idx_inc = true; > - } else { > - numa->cur_index--; > - } > + numa_idx = numa->cur_index++ % numa->n_pmds; > } > return numa->pmds[numa_idx]; > @@ -4323,7 +4358,4 @@ compare_rxq_cycles(const void *a, const void *b) > * pmds to unpinned queues. > * > - * If 'pinned' is false queues will be sorted by processing cycles they are > - * consuming and then assigned to pmds in round robin order. > - * > * The function doesn't touch the pmd threads, it just stores the assignment > * in the 'pmd' member of each rxq. */ > @@ -4338,5 +4370,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) > OVS_REQUIRES(dp->port_mutex) > struct rr_numa *numa = NULL; > int numa_id; > + bool assign_cyc; > > + atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc); > HMAP_FOR_EACH (port, node, &dp->ports) { > if (!netdev_is_pmd(port->netdev)) { > @@ -4368,10 +4402,13 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) > OVS_REQUIRES(dp->port_mutex) > rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1)); > } > - /* Sum the queue intervals and store the cycle history. */ > - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { > - cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); > - } > - dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, > cycle_hist); > > + if (assign_cyc) { > + /* Sum the queue intervals and store the cycle history. > */ > + for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { > + cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); > + } > + dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, > + cycle_hist); > + } > /* Store the queue. */ > rxqs[n_rxqs++] = q; > @@ -4380,5 +4417,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) > OVS_REQUIRES(dp->port_mutex) > } > > - if (n_rxqs > 1) { > + if (n_rxqs > 1 && assign_cyc) { > /* Sort the queues in order of the processing cycles > * they consumed during their last pmd interval. */ > @@ -4404,5 +4441,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) > OVS_REQUIRES(dp->port_mutex) > continue; > } > - rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa); > + rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc); > VLOG_WARN("There's no available (non-isolated) pmd thread " > "on numa node %d. Queue %d on port \'%s\' will " > @@ -4413,11 +4450,21 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) > OVS_REQUIRES(dp->port_mutex) > rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id); > } else { > - rxqs[i]->pmd = rr_numa_get_pmd(numa); > - VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " > - "rx queue %d (measured processing cycles %"PRIu64").", > - rxqs[i]->pmd->core_id, numa_id, > - netdev_rxq_get_name(rxqs[i]->rx), > - netdev_rxq_get_queue_id(rxqs[i]->rx), > - dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST)); > + rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc); > + if (assign_cyc) { > + VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " > + "rx queue %d " > + "(measured processing cycles %"PRIu64").", > + rxqs[i]->pmd->core_id, numa_id, > + netdev_rxq_get_name(rxqs[i]->rx), > + netdev_rxq_get_queue_id(rxqs[i]->rx), > + dp_netdev_rxq_get_cycles(rxqs[i], > + RXQ_CYCLES_PROC_HIST)); > + } else { > + VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " > + "rx queue %d.", rxqs[i]->pmd->core_id, numa_id, > + netdev_rxq_get_name(rxqs[i]->rx), > + netdev_rxq_get_queue_id(rxqs[i]->rx)); > + } > + > } > } > diff --git a/tests/pmd.at b/tests/pmd.at > index 4cae6c8..7cbaf41 100644 > --- a/tests/pmd.at > +++ b/tests/pmd.at > @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [ > > m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id > \)[[0-9]]*:/\1<cleared>\2<cleared>:/"]) > -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 > 7/<group>/"]) > +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 > 7/<group>/"]) > +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0 2 4 > 6/<group>/"]) > m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"]) > > @@ -146,9 +147,19 @@ pmd thread numa_id <cleared> core_id <cleared>: > ]) > > +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=cycles]) > TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]]) > AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3]) > CHECK_PMD_THREADS_CREATED([2], [], [+$TMP]) > > -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed > ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed > SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed > ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed > SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl > +port: p0 queue-id: <group> > +port: p0 queue-id: <group> > +]) > + > +AT_CHECK([ovs-vsctl set Open_vSwitch . > other_config:pmd-rxq-assign=portqueue]) > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl > +pmd rxq rebalance requested. > +]) > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed > ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed > SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl > port: p0 queue-id: <group> > port: p0 queue-id: <group> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index 0cd8520..b5789a5 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -2810,4 +2810,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 > type=patch options:peer=p1 \ > </column> > > + <column name="other_config" key="pmd-rxq-assign" > + type='{"type": "string"}'> > + <p> > + Specifies how RX queues are automatically assigned to CPU cores. > + Options: > + <code>cycles</code> - Sort queues in descending order of > + measured processing cycles before assigning round robin > + to CPUs. > + <code>portqueue</code> - Sort queues in ascending order > + of port/queue number before assign round robin to CPUs. > + </p> > + <p> > + The default value is <code>cycles</code>. > + </p> > + <p> > + Note this this will not automatically perform a reschedule. > + When the mode is set to <code>portqueue</code> the > + <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code> command will > + not change the Rxq to PMD assignment, as the Port/Queue/PMDs config > + will be unchanged from the last reconfiguration. > + </p> > + </column> > + > <column name="options" key="vhost-server-path" > type='{"type": "string"}'> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
