Re: [ovs-dev] [PATCH 3/5] dpif-netdev: Add group rxq scheduling assignment type.

Kevin Traynor Thu, 01 Jul 2021 16:19:44 -0700

On 24/06/2021 15:52, David Marchand wrote:
> On Fri, Jun 4, 2021 at 11:19 PM Kevin Traynor <[email protected]> wrote:
>>
>> Add an rxq scheduling option that allows rxqs to be grouped
>> on a pmd based purely on their load.
>>
>> The current default 'cycles' assignment sorts rxqs by measured
>> processing load and then assigns them to a list of round robin PMDs.
>> This helps to keep the rxqs that require most processing on different
>> cores but as it selects the PMDs in round robin order, it equally
>> distributes rxqs to PMDs.
>>
>> 'cycles' assignment has the advantage in that it separates the most
>> loaded rxqs from being on the same core but maintains the rxqs being
>> spread across a broad range of PMDs to mitigate against changes to
>> traffic pattern.
>>
>> 'cycles' assignment has the disadvantage that in order to make the
>> trade off between optimising for current traffic load and mitigating
>> against future changes, it tries to assign and equal amount of rxqs
>> per PMD in a round robin manner and this can lead to less than optimal
>> balance of the processing load.
>>
>> Now that PMD auto load balance can help mitigate with future changes in
>> traffic patterns, a 'group' assignment can be used to assign rxqs based
>> on their measured cycles and the estimated running total of the PMDs.
>>
>> In this case, there is no restriction about keeping equal number of
>> rxqs per PMD as it is purely load based.
>>
>> This means that one PMD may have a group of low load rxqs assigned to it
>> while another PMD has one high load rxq assigned to it, as that is the
>> best balance of their measured loads across the PMDs.
>>
>> Signed-off-by: Kevin Traynor <[email protected]>
>> ---
>>  Documentation/topics/dpdk/pmd.rst |  26 ++++++
>>  lib/dpif-netdev.c                 | 141 +++++++++++++++++++++++++-----
>>  vswitchd/vswitch.xml              |   5 +-
>>  3 files changed, 148 insertions(+), 24 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/pmd.rst 
>> b/Documentation/topics/dpdk/pmd.rst
>> index e481e7941..d1c45cdfb 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -137,4 +137,30 @@ The Rx queues will be assigned to the cores in the 
>> following order::
>>      Core 8: Q3 (60%) | Q0 (30%)
>>
>> +``group`` assignment is similar to ``cycles`` in that the Rxqs will be
>> +ordered by their measured processing cycles before being assigned to PMDs.
>> +It differs from ``cycles`` in that it uses a running estimate of the cycles
>> +that will be on each PMD to select the PMD with the lowest load for each 
>> Rxq.
>> +
>> +This means that there can be a group of low traffic Rxqs on one PMD, while a
>> +high traffic Rxq may have a PMD to itself. Where ``cycles`` kept as close to
>> +the same number of Rxqs per PMD as possible, with ``group`` this 
>> restriction is
>> +removed for a better balance of the workload across PMDs.
>> +
>> +For example, where there are five Rx queues and three cores - 3, 7, and 8 -
>> +available and the measured usage of core cycles per Rx queue over the last
>> +interval is seen to be:
>> +
>> +- Queue #0: 10%
>> +- Queue #1: 80%
>> +- Queue #3: 50%
>> +- Queue #4: 70%
>> +- Queue #5: 10%
>> +
>> +The Rx queues will be assigned to the cores in the following order::
>> +
>> +    Core 3: Q1 (80%) |
>> +    Core 7: Q4 (70%) |
>> +    Core 8: Q3 (50%) | Q0 (10%) | Q5 (10%)
>> +
>>  Alternatively, ``roundrobin`` assignment can be used, where the Rxqs are
>>  assigned to PMDs in a round-robined fashion. This algorithm was used by
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index eaa4e9733..61e0a516f 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -306,4 +306,11 @@ struct pmd_auto_lb {
>>  };
>>
>> +enum sched_assignment_type {
>> +    SCHED_ROUNDROBIN,
>> +    SCHED_CYCLES, /* Default.*/
>> +    SCHED_GROUP,
>> +    SCHED_MAX
>> +};
>> +
>>  /* Datapath based on the network device interface from netdev.h.
>>   *
>> @@ -367,5 +374,5 @@ struct dp_netdev {
>>      struct ovs_mutex tx_qid_pool_mutex;
>>      /* Use measured cycles for rxq to pmd assignment. */
>> -    bool pmd_rxq_assign_cyc;
>> +    enum sched_assignment_type pmd_rxq_assign_cyc;
>>
>>      /* Protects the access of the 'struct dp_netdev_pmd_thread'
>> @@ -1799,5 +1806,5 @@ create_dp_netdev(const char *name, const struct 
>> dpif_class *class,
>>
>>      cmap_init(&dp->poll_threads);
>> -    dp->pmd_rxq_assign_cyc = true;
>> +    dp->pmd_rxq_assign_cyc = SCHED_CYCLES;
>>
>>      ovs_mutex_init(&dp->tx_qid_pool_mutex);
>> @@ -4223,5 +4230,5 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
>>      bool enable_alb = false;
>>      bool multi_rxq = false;
>> -    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
>> +    enum sched_assignment_type pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
>>
>>      /* Ensure that there is at least 2 non-isolated PMDs and
>> @@ -4242,6 +4249,6 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
>>      }
>>
>> -    /* Enable auto LB if it is requested and cycle based assignment is 
>> true. */
>> -    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
>> +    /* Enable auto LB if requested and not using roundrobin assignment. */
>> +    enable_alb = enable_alb && pmd_rxq_assign_cyc != SCHED_ROUNDROBIN &&
>>                      pmd_alb->auto_lb_requested;
>>
>> @@ -4284,4 +4291,5 @@ dpif_netdev_set_config(struct dpif *dpif, const struct 
>> smap *other_config)
>>      uint8_t rebalance_improve;
>>      bool log_autolb = false;
>> +    enum sched_assignment_type pmd_rxq_assign_cyc;
>>
>>      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
>> @@ -4342,9 +4350,15 @@ dpif_netdev_set_config(struct dpif *dpif, const 
>> struct smap *other_config)
>>      }
>>
>> -    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
>> -    if (!pmd_rxq_assign_cyc && strcmp(pmd_rxq_assign, "roundrobin")) {
>> -        VLOG_WARN("Unsupported Rxq to PMD assignment mode in 
>> pmd-rxq-assign. "
>> -                      "Defaulting to 'cycles'.");
>> -        pmd_rxq_assign_cyc = true;
>> +    if (!strcmp(pmd_rxq_assign, "roundrobin")) {
>> +        pmd_rxq_assign_cyc = SCHED_ROUNDROBIN;
>> +    } else if (!strcmp(pmd_rxq_assign, "cycles")) {
>> +        pmd_rxq_assign_cyc = SCHED_CYCLES;
>> +    } else if (!strcmp(pmd_rxq_assign, "group")) {
>> +        pmd_rxq_assign_cyc = SCHED_GROUP;
>> +    } else {
>> +        /* default */
>> +        VLOG_WARN("Unsupported rx queue to PMD assignment mode in "
>> +                  "pmd-rxq-assign. Defaulting to 'cycles'.");
>> +        pmd_rxq_assign_cyc = SCHED_CYCLES;
>>          pmd_rxq_assign = "cycles";
>>      }
>> @@ -5171,4 +5185,61 @@ compare_rxq_cycles(const void *a, const void *b)
>>  }
>>
>> +static struct sched_pmd *
>> +get_lowest_num_rxq_pmd(struct sched_numa *numa)
>> +{
>> +    struct sched_pmd *lowest_rxqs_sched_pmd = NULL;
>> +    unsigned lowest_rxqs = UINT_MAX;
>> +
>> +    /* find the pmd with lowest number of rxqs */
>> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
>> +        struct sched_pmd *sched_pmd;
>> +        unsigned num_rxqs;
>> +
>> +        sched_pmd = &numa->pmds[i];
>> +        num_rxqs = sched_pmd->n_rxq;
>> +        if (sched_pmd->isolated) {
>> +            continue;
>> +        }
>> +
>> +        /* If this current load is higher we can go to the next one */
>> +        if (num_rxqs > lowest_rxqs) {
> 
> Testing for >= avoids the (wrongly indented) test below.
>


Actually there is no real need for this 'if() {continue;}'. I had some
other code below, but now that it is just a simple if() below, I can
remove and let it fall through.

> 
>> +            continue;
>> +        }
>> +       if (num_rxqs < lowest_rxqs) {
>> +           lowest_rxqs = num_rxqs;
>> +           lowest_rxqs_sched_pmd = sched_pmd;
>> +        }

fixed indent

>> +    }
>> +    return lowest_rxqs_sched_pmd;
>> +}
>> +
>> +static struct sched_pmd *
>> +get_lowest_proc_pmd(struct sched_numa *numa)
>> +{
>> +    struct sched_pmd *lowest_loaded_sched_pmd = NULL;
>> +    uint64_t lowest_load = UINT64_MAX;
>> +
>> +    /* find the pmd with the lowest load */
>> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
>> +        struct sched_pmd *sched_pmd;
>> +        uint64_t pmd_load;
>> +
>> +        sched_pmd = &numa->pmds[i];
>> +        if (sched_pmd->isolated) {
>> +            continue;
>> +        }
>> +        pmd_load = sched_pmd->pmd_proc_cycles;
>> +        /* If this current load is higher we can go to the next one */
>> +        if (pmd_load > lowest_load) {
>> +            continue;
>> +        }
>> +       if (pmd_load < lowest_load) {
>> +           lowest_load = pmd_load;
>> +           lowest_loaded_sched_pmd = sched_pmd;
>> +        }
>> +    }
>> +    return lowest_loaded_sched_pmd;
>> +}
>> +
>>  /*
>>   * Returns the next pmd from the numa node.
>> @@ -5229,16 +5300,40 @@ get_available_rr_pmd(struct sched_numa *numa, bool 
>> updown)
>>
>>  static struct sched_pmd *
>> -get_next_pmd(struct sched_numa *numa, bool algo)
>> +get_next_pmd(struct sched_numa *numa, enum sched_assignment_type algo,
>> +             bool has_proc)
>>  {
>> -    return get_available_rr_pmd(numa, algo);
>> +    if (algo == SCHED_GROUP) {
>> +        struct sched_pmd *sched_pmd = NULL;
>> +
>> +        /* Check if the rxq has associated cycles. This is handled 
>> differently
>> +         * as adding an zero cycles rxq to a PMD will mean that the lowest
>> +         * core would not change on a subsequent call and all zero rxqs 
>> would
>> +         * be assigned to the same PMD. */
>> +        if (has_proc) {
>> +            sched_pmd = get_lowest_proc_pmd(numa);
>> +        } else {
>> +            sched_pmd = get_lowest_num_rxq_pmd(numa);
>> +        }
>> +        /* If there is a pmd selected, return it now. */
>> +        if (sched_pmd) {
>> +            return sched_pmd;
>> +        }
> 
> The only case where sched_pmd == NULL is when n_pmds == 0 in which
> case the rr stuff would also lead to no pmd available.
> And it seems unintuitive that in non rr mode we would still call the
> rr pmd selector.
> 
> I would simply return sched_pmd here.
> 

Yes, it's true that the result should be the same, i.e. no available
sched_pmd. In fact, with that change I can also remove the sched_pmd var
and just return directly from the get_lowest_*() functions. (actually,
*function* as I have now combined them)

> 
> 
>> +    }
>> +
>> +    /* By default or as a last resort, just RR the PMDs. */
>> +    return get_available_rr_pmd(numa, algo == SCHED_CYCLES ? true : false);
> 
> This ternary is unneeded.
> 

Not sure I should rely on enum values converting ok to a bool. It feels
a bit fragile in case the enum ever changes.

> 
>>  }
>>
> 
> [snip]
> 
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 3/5] dpif-netdev: Add group rxq scheduling assignment type.

Reply via email to