On 24/06/2021 15:52, David Marchand wrote:
> On Fri, Jun 4, 2021 at 11:19 PM Kevin Traynor <[email protected]> wrote:
>>
>> Add an rxq scheduling option that allows rxqs to be grouped
>> on a pmd based purely on their load.
>>
>> The current default 'cycles' assignment sorts rxqs by measured
>> processing load and then assigns them to a list of round robin PMDs.
>> This helps to keep the rxqs that require most processing on different
>> cores but as it selects the PMDs in round robin order, it equally
>> distributes rxqs to PMDs.
>>
>> 'cycles' assignment has the advantage in that it separates the most
>> loaded rxqs from being on the same core but maintains the rxqs being
>> spread across a broad range of PMDs to mitigate against changes to
>> traffic pattern.
>>
>> 'cycles' assignment has the disadvantage that in order to make the
>> trade off between optimising for current traffic load and mitigating
>> against future changes, it tries to assign and equal amount of rxqs
>> per PMD in a round robin manner and this can lead to less than optimal
>> balance of the processing load.
>>
>> Now that PMD auto load balance can help mitigate with future changes in
>> traffic patterns, a 'group' assignment can be used to assign rxqs based
>> on their measured cycles and the estimated running total of the PMDs.
>>
>> In this case, there is no restriction about keeping equal number of
>> rxqs per PMD as it is purely load based.
>>
>> This means that one PMD may have a group of low load rxqs assigned to it
>> while another PMD has one high load rxq assigned to it, as that is the
>> best balance of their measured loads across the PMDs.
>>
>> Signed-off-by: Kevin Traynor <[email protected]>
>> ---
>> Documentation/topics/dpdk/pmd.rst | 26 ++++++
>> lib/dpif-netdev.c | 141 +++++++++++++++++++++++++-----
>> vswitchd/vswitch.xml | 5 +-
>> 3 files changed, 148 insertions(+), 24 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/pmd.rst
>> b/Documentation/topics/dpdk/pmd.rst
>> index e481e7941..d1c45cdfb 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -137,4 +137,30 @@ The Rx queues will be assigned to the cores in the
>> following order::
>> Core 8: Q3 (60%) | Q0 (30%)
>>
>> +``group`` assignment is similar to ``cycles`` in that the Rxqs will be
>> +ordered by their measured processing cycles before being assigned to PMDs.
>> +It differs from ``cycles`` in that it uses a running estimate of the cycles
>> +that will be on each PMD to select the PMD with the lowest load for each
>> Rxq.
>> +
>> +This means that there can be a group of low traffic Rxqs on one PMD, while a
>> +high traffic Rxq may have a PMD to itself. Where ``cycles`` kept as close to
>> +the same number of Rxqs per PMD as possible, with ``group`` this
>> restriction is
>> +removed for a better balance of the workload across PMDs.
>> +
>> +For example, where there are five Rx queues and three cores - 3, 7, and 8 -
>> +available and the measured usage of core cycles per Rx queue over the last
>> +interval is seen to be:
>> +
>> +- Queue #0: 10%
>> +- Queue #1: 80%
>> +- Queue #3: 50%
>> +- Queue #4: 70%
>> +- Queue #5: 10%
>> +
>> +The Rx queues will be assigned to the cores in the following order::
>> +
>> + Core 3: Q1 (80%) |
>> + Core 7: Q4 (70%) |
>> + Core 8: Q3 (50%) | Q0 (10%) | Q5 (10%)
>> +
>> Alternatively, ``roundrobin`` assignment can be used, where the Rxqs are
>> assigned to PMDs in a round-robined fashion. This algorithm was used by
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index eaa4e9733..61e0a516f 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -306,4 +306,11 @@ struct pmd_auto_lb {
>> };
>>
>> +enum sched_assignment_type {
>> + SCHED_ROUNDROBIN,
>> + SCHED_CYCLES, /* Default.*/
>> + SCHED_GROUP,
>> + SCHED_MAX
>> +};
>> +
>> /* Datapath based on the network device interface from netdev.h.
>> *
>> @@ -367,5 +374,5 @@ struct dp_netdev {
>> struct ovs_mutex tx_qid_pool_mutex;
>> /* Use measured cycles for rxq to pmd assignment. */
>> - bool pmd_rxq_assign_cyc;
>> + enum sched_assignment_type pmd_rxq_assign_cyc;
>>
>> /* Protects the access of the 'struct dp_netdev_pmd_thread'
>> @@ -1799,5 +1806,5 @@ create_dp_netdev(const char *name, const struct
>> dpif_class *class,
>>
>> cmap_init(&dp->poll_threads);
>> - dp->pmd_rxq_assign_cyc = true;
>> + dp->pmd_rxq_assign_cyc = SCHED_CYCLES;
>>
>> ovs_mutex_init(&dp->tx_qid_pool_mutex);
>> @@ -4223,5 +4230,5 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
>> bool enable_alb = false;
>> bool multi_rxq = false;
>> - bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
>> + enum sched_assignment_type pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
>>
>> /* Ensure that there is at least 2 non-isolated PMDs and
>> @@ -4242,6 +4249,6 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
>> }
>>
>> - /* Enable auto LB if it is requested and cycle based assignment is
>> true. */
>> - enable_alb = enable_alb && pmd_rxq_assign_cyc &&
>> + /* Enable auto LB if requested and not using roundrobin assignment. */
>> + enable_alb = enable_alb && pmd_rxq_assign_cyc != SCHED_ROUNDROBIN &&
>> pmd_alb->auto_lb_requested;
>>
>> @@ -4284,4 +4291,5 @@ dpif_netdev_set_config(struct dpif *dpif, const struct
>> smap *other_config)
>> uint8_t rebalance_improve;
>> bool log_autolb = false;
>> + enum sched_assignment_type pmd_rxq_assign_cyc;
>>
>> tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
>> @@ -4342,9 +4350,15 @@ dpif_netdev_set_config(struct dpif *dpif, const
>> struct smap *other_config)
>> }
>>
>> - bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
>> - if (!pmd_rxq_assign_cyc && strcmp(pmd_rxq_assign, "roundrobin")) {
>> - VLOG_WARN("Unsupported Rxq to PMD assignment mode in
>> pmd-rxq-assign. "
>> - "Defaulting to 'cycles'.");
>> - pmd_rxq_assign_cyc = true;
>> + if (!strcmp(pmd_rxq_assign, "roundrobin")) {
>> + pmd_rxq_assign_cyc = SCHED_ROUNDROBIN;
>> + } else if (!strcmp(pmd_rxq_assign, "cycles")) {
>> + pmd_rxq_assign_cyc = SCHED_CYCLES;
>> + } else if (!strcmp(pmd_rxq_assign, "group")) {
>> + pmd_rxq_assign_cyc = SCHED_GROUP;
>> + } else {
>> + /* default */
>> + VLOG_WARN("Unsupported rx queue to PMD assignment mode in "
>> + "pmd-rxq-assign. Defaulting to 'cycles'.");
>> + pmd_rxq_assign_cyc = SCHED_CYCLES;
>> pmd_rxq_assign = "cycles";
>> }
>> @@ -5171,4 +5185,61 @@ compare_rxq_cycles(const void *a, const void *b)
>> }
>>
>> +static struct sched_pmd *
>> +get_lowest_num_rxq_pmd(struct sched_numa *numa)
>> +{
>> + struct sched_pmd *lowest_rxqs_sched_pmd = NULL;
>> + unsigned lowest_rxqs = UINT_MAX;
>> +
>> + /* find the pmd with lowest number of rxqs */
>> + for (unsigned i = 0; i < numa->n_pmds; i++) {
>> + struct sched_pmd *sched_pmd;
>> + unsigned num_rxqs;
>> +
>> + sched_pmd = &numa->pmds[i];
>> + num_rxqs = sched_pmd->n_rxq;
>> + if (sched_pmd->isolated) {
>> + continue;
>> + }
>> +
>> + /* If this current load is higher we can go to the next one */
>> + if (num_rxqs > lowest_rxqs) {
>
> Testing for >= avoids the (wrongly indented) test below.
>
Actually there is no real need for this 'if() {continue;}'. I had some
other code below, but now that it is just a simple if() below, I can
remove and let it fall through.
>
>> + continue;
>> + }
>> + if (num_rxqs < lowest_rxqs) {
>> + lowest_rxqs = num_rxqs;
>> + lowest_rxqs_sched_pmd = sched_pmd;
>> + }
fixed indent
>> + }
>> + return lowest_rxqs_sched_pmd;
>> +}
>> +
>> +static struct sched_pmd *
>> +get_lowest_proc_pmd(struct sched_numa *numa)
>> +{
>> + struct sched_pmd *lowest_loaded_sched_pmd = NULL;
>> + uint64_t lowest_load = UINT64_MAX;
>> +
>> + /* find the pmd with the lowest load */
>> + for (unsigned i = 0; i < numa->n_pmds; i++) {
>> + struct sched_pmd *sched_pmd;
>> + uint64_t pmd_load;
>> +
>> + sched_pmd = &numa->pmds[i];
>> + if (sched_pmd->isolated) {
>> + continue;
>> + }
>> + pmd_load = sched_pmd->pmd_proc_cycles;
>> + /* If this current load is higher we can go to the next one */
>> + if (pmd_load > lowest_load) {
>> + continue;
>> + }
>> + if (pmd_load < lowest_load) {
>> + lowest_load = pmd_load;
>> + lowest_loaded_sched_pmd = sched_pmd;
>> + }
>> + }
>> + return lowest_loaded_sched_pmd;
>> +}
>> +
>> /*
>> * Returns the next pmd from the numa node.
>> @@ -5229,16 +5300,40 @@ get_available_rr_pmd(struct sched_numa *numa, bool
>> updown)
>>
>> static struct sched_pmd *
>> -get_next_pmd(struct sched_numa *numa, bool algo)
>> +get_next_pmd(struct sched_numa *numa, enum sched_assignment_type algo,
>> + bool has_proc)
>> {
>> - return get_available_rr_pmd(numa, algo);
>> + if (algo == SCHED_GROUP) {
>> + struct sched_pmd *sched_pmd = NULL;
>> +
>> + /* Check if the rxq has associated cycles. This is handled
>> differently
>> + * as adding an zero cycles rxq to a PMD will mean that the lowest
>> + * core would not change on a subsequent call and all zero rxqs
>> would
>> + * be assigned to the same PMD. */
>> + if (has_proc) {
>> + sched_pmd = get_lowest_proc_pmd(numa);
>> + } else {
>> + sched_pmd = get_lowest_num_rxq_pmd(numa);
>> + }
>> + /* If there is a pmd selected, return it now. */
>> + if (sched_pmd) {
>> + return sched_pmd;
>> + }
>
> The only case where sched_pmd == NULL is when n_pmds == 0 in which
> case the rr stuff would also lead to no pmd available.
> And it seems unintuitive that in non rr mode we would still call the
> rr pmd selector.
>
> I would simply return sched_pmd here.
>
Yes, it's true that the result should be the same, i.e. no available
sched_pmd. In fact, with that change I can also remove the sched_pmd var
and just return directly from the get_lowest_*() functions. (actually,
*function* as I have now combined them)
>
>
>> + }
>> +
>> + /* By default or as a last resort, just RR the PMDs. */
>> + return get_available_rr_pmd(numa, algo == SCHED_CYCLES ? true : false);
>
> This ternary is unneeded.
>
Not sure I should rely on enum values converting ok to a bool. It feels
a bit fragile in case the enum ever changes.
>
>> }
>>
>
> [snip]
>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev