Re: [ovs-dev] [PATCH v2] Adding support for PMD auto load balancing

Nitin Katiyar Sun, 06 Jan 2019 20:47:54 -0800


> -----Original Message-----
> From: Kevin Traynor [mailto:ktray...@redhat.com]
> Sent: Friday, January 04, 2019 10:26 PM
> To: Nitin Katiyar <nitin.kati...@ericsson.com>; ovs-dev@openvswitch.org
> Cc: Rohith Basavaraja <rohith.basavar...@gmail.com>; Venkatesan Pradeep
> <venkatesan.prad...@ericsson.com>
> Subject: Re: [PATCH v2] Adding support for PMD auto load balancing
> 
> On 01/04/2019 02:56 PM, Kevin Traynor wrote:
> > On 01/04/2019 12:39 PM, Nitin Katiyar wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: Kevin Traynor [mailto:ktray...@redhat.com]
> >>> Sent: Friday, January 04, 2019 1:48 AM
> >>> To: Nitin Katiyar <nitin.kati...@ericsson.com>;
> >>> ovs-dev@openvswitch.org
> >>> Cc: Rohith Basavaraja <rohith.basavar...@gmail.com>; Venkatesan
> >>> Pradeep <venkatesan.prad...@ericsson.com>
> >>> Subject: Re: [PATCH v2] Adding support for PMD auto load balancing
> >>>
> >>> On 01/03/2019 12:36 PM, Nitin Katiyar wrote:
> >>>> Port rx queues that have not been statically assigned to PMDs are
> >>>> currently assigned based on periodically sampled load measurements.
> >>>> The assignment is performed at specific instances – port addition,
> >>>> port deletion, upon reassignment request via CLI etc.
> >>>>
> >>>> Due to change in traffic pattern over time it can cause uneven load
> >>>> among the PMDs and thus resulting in lower overall throughout.
> >>>>
> >>>> This patch enables the support of auto load balancing of PMDs based
> >>>> on measured load of RX queues. Each PMD measures the processing
> >>>> load for each of its associated queues every 10 seconds. If the
> >>>> aggregated PMD load reaches 95% for 6 consecutive intervals then
> >>>> PMD considers itself to
> >>> be overloaded.
> >>>>
> >>>> If any PMD is overloaded, a dry-run of the PMD assignment algorithm
> >>>> is performed by OVS main thread. The dry-run does NOT change the
> >>>> existing queue to PMD assignments.
> >>>>
> >>>> If the resultant mapping of dry-run indicates an improved
> >>>> distribution of the load then the actual reassignment will be performed.
> >>>>
> >>>> The automatic rebalancing will be disabled by default and has to be
> >>>> enabled via configuration option. The interval (in minutes) between
> >>>> two consecutive rebalancing can also be configured via CLI, default
> >>>> is
> >>>> 1 min.
> >>>>
> >>>> Following example commands can be used to set the auto-lb params:
> >>>> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> >>>> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-
> intvl="5"
> >>>>
> >>>
> >>> Hi Nitin, thanks for v2. Not full review yet but sending some comments
> below.
> >>>
> 
> Additional minor comment below, thanks.
Thanks again Kevin.
> 
> >>> Maybe you can put some of the above into a new section below this
> >>> http://docs.openvswitch.org/en/latest/topics/dpdk/pmd/#port-rx-queue
> >>> -
> >>> assigment-to-pmd-threads
> >> Sure, I will update that too.
> >>>
> >>> I also think this feature should be experimental until it has been
> >>> road tested a bit more.
> >>>
> >>>> Co-authored-by: Rohith Basavaraja <rohith.basavar...@gmail.com>
> >>>> Co-authored-by: Venkatesan Pradeep
> >>> <venkatesan.prad...@ericsson.com>
> >>>> Signed-off-by: Rohith Basavaraja <rohith.basavar...@gmail.com>
> >>>> Signed-off-by: Venkatesan Pradeep
> <venkatesan.prad...@ericsson.com>
> >>>> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
> >>>> ---
> >>>>  lib/dpif-netdev.c    | 403
> >>> +++++++++++++++++++++++++++++++++++++++++++++++++--
> >>>>  vswitchd/vswitch.xml |  30 ++++
> >>>>  2 files changed, 424 insertions(+), 9 deletions(-)
> >>>>
> >>>
> >>> There seems to be windows style line endings in the patch? or
> >>> something else has gone wrong for me.
> >>>
> >>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> >>>> 1564db9..8db106f 100644
> >>>> --- a/lib/dpif-netdev.c
> >>>> +++ b/lib/dpif-netdev.c
> >>>> @@ -80,6 +80,12 @@
> >>>>
> >>>>  VLOG_DEFINE_THIS_MODULE(dpif_netdev);
> >>>>
> >>>> +/* Auto Load Balancing Defaults */
> >>>> +#define ACCEPT_IMPROVE_DEFAULT       (25)
> >>>> +#define PMD_LOAD_THRE_DEFAULT        (95)
> >>>
> >>> Probably you should remove the brackets above to be consistent with
> >>> the others below and in the rest of the file.
> >>>
> >>>> +#define PMD_REBALANCE_POLL_INTERVAL  1 /* 1 Min */
> >>>> +#define MIN_TO_MSEC                  60000
> >>>> +
> >>>>  #define FLOW_DUMP_MAX_BATCH 50
> >>>>  /* Use per thread recirc_depth to prevent recirculation loop. */
> >>>> #define MAX_RECIRC_DEPTH 6 @@ -288,6 +294,13 @@ struct
> dp_meter {
> >>>>      struct dp_meter_band bands[];
> >>>>  };
> >>>>
> >>>> +struct pmd_auto_lb {
> >>>> +    bool auto_lb_conf;        /* enable-disable auto load balancing */
> >>>
> >>> I'm not sure what '_conf' is short for? maybe it should be something
> >>> like 'auto_lb_requested'
> >> Sure
> >>>
> >>>> +    bool is_enabled;          /* auto_lb current status */
> >>>
> >>> Comments should be of style /* Sentence case. */
> >>> http://docs.openvswitch.org/en/latest/internals/contributing/coding-
> >>> style/#comments
> >>>
> >> Thanks for providing the link. I will update in next version
> >>>
> >>>> +    uint64_t rebalance_intvl;
> >>>> +    uint64_t rebalance_poll_timer; };
> >>>> +
> >>>>  /* Datapath based on the network device interface from netdev.h.
> >>>>   *
> >>>>   *
> >>>> @@ -368,6 +381,7 @@ struct dp_netdev {
> >>>>      uint64_t last_tnl_conf_seq;
> >>>>
> >>>>      struct conntrack conntrack;
> >>>> +    struct pmd_auto_lb pmd_alb;
> >>>>  };
> >>>>
> >>>>  static void meter_lock(const struct dp_netdev *dp, uint32_t
> >>>> meter_id) @@ -682,6 +696,7 @@ struct dp_netdev_pmd_thread {
> >>>>      struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and
> 'tx_ports'. */
> >>>>      /* List of rx queues to poll. */
> >>>>      struct hmap poll_list OVS_GUARDED;
> >>>> +
> >>>
> >>> Unrelated newline should be removed
> >>>
> >>>>      /* Map of 'tx_port's used for transmission.  Written by the main
> thread,
> >>>>       * read by the pmd thread. */
> >>>>      struct hmap tx_ports OVS_GUARDED; @@ -702,6 +717,11 @@ struct
> >>>> dp_netdev_pmd_thread {
> >>>>      /* Keep track of detailed PMD performance statistics. */
> >>>>      struct pmd_perf_stats perf_stats;
> >>>>
> >>>> +    /* Some stats from previous iteration used by automatic pmd
> >>>> +       load balance logic. */
> >>>
> >>> Nit, but see coding stds. and other multi-line comments wrt style
> >>>
> >>>> +    uint64_t prev_stats[PMD_N_STATS];> +    atomic_count
> >>> pmd_overloaded;
> >>>> +
> >>>>      /* Set to true if the pmd thread needs to be reloaded. */
> >>>>      bool need_reload;
> >>>>  };
> >>>> @@ -792,9 +812,11 @@ dp_netdev_rxq_get_cycles(struct
> dp_netdev_rxq
> >>> *rx,
> >>>>                           enum rxq_cycles_counter_type type);
> >>>> static void  dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx,
> >>>> -                           unsigned long long cycles);
> >>>> +                                unsigned long long cycles,
> >>>> +                                unsigned idx);
> >>>>  static uint64_t
> >>>> -dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned
> >>>> idx);
> >>>> +dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx,
> >>>> +                                unsigned idx);
> >>>
> >>> no need to change dp_netdev_rxq_get_intrvl_cycles()
> >>>
> >>>>  static void
> >>>>  dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread
> >>> *pmd,
> >>>>                                 bool purge); @@ -3734,6 +3756,51 @@
> >>>> dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t
> n_ops,
> >>>>      }
> >>>>  }
> >>>>
> >>>> +/* Enable/Disable PMD auto load balancing */ static void
> >>>> +set_pmd_auto_lb(struct dp_netdev *dp) {
> >>>> +    unsigned int cnt = 0;
> >>>> +    struct dp_netdev_pmd_thread *pmd;
> >>>> +    struct pmd_auto_lb * pmd_alb = &dp->pmd_alb;
> >>>> +
> >>>> +    bool enable = false;
> >>>> +    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
> >>>> +
> >>>> +    /* Ensure that there is at least 2 non-isolated PMDs and
> >>>> +     * one of the PMD is polling more than one rxq
> >>>> +     */
> >>>> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> >>>> +        if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        cnt++;
> >>>> +        if (hmap_count(&pmd->poll_list) > 1) {
> >>>> +            if (enable && (cnt > 1)) {
> >>>> +                break;
> >>>> +            } else {
> >>>> +                enable = true;
> >>>> +            }
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>
> >>> Won't this give the wrong result if there is one pmd with multiple
> >>> rxq's? How about something in the loop like,
> >> Yes, you are right. Thanks for catching this.
> >>>
> >>>         if (hmap_count(&pmd->poll_list) > 1) {
> >>>             multirxq = true;
> >>>         }
> >>>         if (cnt && multirxq) {
> >>>             enable = true;
> >>>             break;
> >>>         }
> >>>         cnt++;
> >>>
> >>>> +    /* Enable auto LB if it is configured and cycle based assignment is
> true */
> >>>> +    enable = enable && pmd_rxq_assign_cyc &&
> >>>> + pmd_alb->auto_lb_conf;
> >>>> +
> >>>> +    if (pmd_alb->is_enabled != enable) {
> >>>> +        pmd_alb->is_enabled = enable;
> >>>> +        if (pmd_alb->is_enabled) {
> >>>> +            VLOG_INFO("PMD auto lb is enabled, rebalance
> intvl:%lu(msec)\n",
> >>>> +                       pmd_alb->rebalance_intvl);
> >>>> +        } else {
> >>>> +            pmd_alb->rebalance_poll_timer = 0;
> >>>> +            VLOG_INFO("PMD auto lb is disabled\n");
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +}
> >>>> +
> >>>>  /* Applies datapath configuration from the database. Some of the
> >>>> changes
> >>> are
> >>>>   * actually applied in dpif_netdev_run(). */  static int @@
> >>>> -3748,6
> >>>> +3815,7 @@ dpif_netdev_set_config(struct dpif *dpif, const struct
> >>>> +smap
> >>> *other_config)
> >>>>                          DEFAULT_EM_FLOW_INSERT_INV_PROB);
> >>>>      uint32_t insert_min, cur_min;
> >>>>      uint32_t tx_flush_interval, cur_tx_flush_interval;
> >>>> +    uint64_t rebalance_intvl;
> >>>>
> >>>>      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
> >>>>                                       DEFAULT_TX_FLUSH_INTERVAL);
> >>>> @@
> >>>> -3819,6 +3887,23 @@ dpif_netdev_set_config(struct dpif *dpif, const
> >>> struct smap *other_config)
> >>>>                    pmd_rxq_assign);
> >>>>          dp_netdev_request_reconfigure(dp);
> >>>>      }
> >>>> +
> >>>> +    struct pmd_auto_lb * pmd_alb = &dp->pmd_alb;
> >>>> +    pmd_alb->auto_lb_conf = smap_get_bool(other_config, "pmd-auto-
> lb",
> >>>> +                              false);
> >>>> +
> >>>> +    rebalance_intvl = smap_get_int(other_config,
> >>>> + "pmd-auto-lb-rebalance-
> >>> intvl",
> >>>> +                              PMD_REBALANCE_POLL_INTERVAL);
> >>>> +
> >>>> +    /* Input is in min, convert it to msec */
> >>>> +    rebalance_intvl =
> >>>> +        rebalance_intvl ? rebalance_intvl * MIN_TO_MSEC :
> >>>> + MIN_TO_MSEC;
> >>>> +
> >>>
> >>> This is creating a default when the user sets 0 - that needs to be
> documented.
> >>>
> >>> With current values, this could overflow rebalance_intvl. The user
> >>> value is in minutes, so suggest to limit user input to some
> >>> reasonable value like 1 week in minutes, and then the min to msec
> >>> can be safe. See tx-flush-interval as an example where the range is
> limited.
> >> Thanks, I will update the documentation.
> >>>
> >>>> +    if (pmd_alb->rebalance_intvl != rebalance_intvl) {
> >>>> +        pmd_alb->rebalance_intvl = rebalance_intvl;
> >>>> +    }
> >>>> +
> >>>> +    set_pmd_auto_lb(dp);
> >>>>      return 0;
> >>>>  }
> >>>>
> >>>> @@ -3974,9 +4059,9 @@ dp_netdev_rxq_get_cycles(struct
> >>> dp_netdev_rxq
> >>>> *rx,
> >>>>
> >>>>  static void
> >>>>  dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx,
> >>>> -                                unsigned long long cycles)
> >>>> +                                unsigned long long cycles,
> >>>> +                                unsigned idx)
> >>>>  {
> >>>> -    unsigned int idx = rx->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX;
> >>>>      atomic_store_relaxed(&rx->cycles_intrvl[idx], cycles);  }
> >>>>
> >>>> @@ -4762,6 +4847,9 @@ reconfigure_datapath(struct dp_netdev *dp)
> >>>>
> >>>>      /* Reload affected pmd threads. */
> >>>>      reload_affected_pmds(dp);
> >>>> +
> >>>> +    /* Check if PMD Auto LB is to be enabled */
> >>>> +    set_pmd_auto_lb(dp);
> >>>>  }
> >>>>
> >>>>  /* Returns true if one of the netdevs in 'dp' requires a
> >>>> reconfiguration */ @@ -4780,6 +4868,228 @@
> >>> ports_require_restart(const struct dp_netdev *dp)
> >>>>      return false;
> >>>>  }
> >>>>
> >>>> +/* Function for calculating variance */ static uint64_t
> >>>> +variance(uint64_t a[], int n) {
> >>>> +    /* Compute mean (average of elements) */
> >>>> +    uint64_t sum = 0;
> >>>> +    uint64_t mean = 0;
> >>>> +    uint64_t sqDiff = 0;
> >>>> +
> >>>> +    if (!n) {
> >>>> +        return 0;
> >>>> +    }
> >>>> +
> >>>> +    for (int i = 0; i < n; i++) {
> >>>> +        sum += a[i];
> >>>> +    }
> >>>> +
> >>>> +    if (sum) {
> >>>> +        mean = sum / n;
> >>>> +
> >>>> +        /* Compute sum squared differences with mean. */
> >>>> +        for (int i = 0; i < n; i++) {
> >>>> +            sqDiff += (a[i] - mean)*(a[i] - mean);
> >>>> +        }
> >>>> +    }
> >>>> +    return (sqDiff ? (sqDiff / n) : 0); }
> >>>> +
> >>>> +static uint64_t
> >>>> +get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list,
> >>>> +uint32_t num)
> >>>
> >>> I think this function would require the port_mutex.
> >> I will check and add in next version.
> >>>
> >>>> +{
> >>>> +    struct dp_netdev_port *port;
> >>>> +    struct dp_netdev_pmd_thread *pmd;
> >>>> +    struct dp_netdev_rxq ** rxqs = NULL;
> >>>> +    struct rr_numa *numa = NULL;
> >>>> +    struct rr_numa_list rr;
> >>>> +    int n_rxqs = 0;
> >>>> +    uint64_t ret = 0;
> >>>> +    uint64_t *pmd_usage;
> >>>> +
> >>>> +    pmd_usage = xcalloc(num, sizeof(uint64_t));
> >>>> +
> >>>> +    HMAP_FOR_EACH (port, node, &dp->ports) {
> >>>> +        if (!netdev_is_pmd(port->netdev)) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        for (int qid = 0; qid < port->n_rxq; qid++) {
> >>>> +            struct dp_netdev_rxq *q = &port->rxqs[qid];
> >>>> +            uint64_t cycle_hist = 0;
> >>>> +
> >>>> +            if (q->pmd->isolated) {
> >>>> +                continue;
> >>>> +            }
> >>>> +
> >>>> +            if (n_rxqs == 0) {
> >>>> +                rxqs = xmalloc(sizeof *rxqs);
> >>>> +            } else {
> >>>> +                rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
> >>>> +            }
> >>>> +
> >>>> +            /* Sum the queue intervals and store the cycle history. */
> >>>> +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> >>>> +                cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
> >>>> +            }
> >>>> +            /* Do we need to add intrvl_cycles in history??
> >>>
> >>> If you want to use compare_rxq_cycles() then you have to put them in
> >>> history, but it's only used for that and re-written everytime, so I
> >>> don't think it is harmful.
> >> Yeah, that is the objective of adding it.
> >>>
> >>>> +             * but then we should clear interval cycles also */
> >>>
> >>> I don't think you should be clearing interval cycles in a dry run,
> >>> otherwise they will be reset if the real rebalance occurs.
> >> Thanks for clarifying it. I will remove the comment.
> >>>
> >>>> +            dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
> >>>> +                                         cycle_hist);
> >>>> +            /* Store the queue. */
> >>>> +            rxqs[n_rxqs++] = q;
> >>>> +        }
> >>>> +    }
> >>>> +    if (n_rxqs > 1) {
> >>>> +        /* Sort the queues in order of the processing cycles
> >>>> +         * they consumed during their last pmd interval. */
> >>>> +        qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
> >>>> +    }
> >>>> +    rr_numa_list_populate(dp, &rr);
> >>>> +
> >>>> +    for (int i = 0; i < n_rxqs; i++) {
> >>>> +        int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
> >>>> +        numa = rr_numa_list_lookup(&rr, numa_id);
> >>>> +        if (!numa) {
> >>>> +            /* Don't consider queues across NUMA  ???*/
> >>>
> >>> I think you should abort the whole dry run process if this is
> >>> happening
> >> Okay.
> >>>
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        pmd = rr_numa_get_pmd(numa, true);
> >>>> +        VLOG_DBG("PMD AUTO_LB:Core %d on numa node %d assigned
> >>>> + port
> >>> \'%s\' "
> >>>> +                  "rx queue %d "
> >>>> +                  "(measured processing cycles %"PRIu64").",
> >>>> +                  pmd->core_id, numa_id,
> >>>> +                  netdev_rxq_get_name(rxqs[i]->rx),
> >>>> +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
> >>>> +                  dp_netdev_rxq_get_cycles(rxqs[i],
> >>>> + RXQ_CYCLES_PROC_HIST));
> >>>> +
> >>>> +        for (int id = 0; id < num; id++) {
> >>>> +            if (pmd->core_id == core_list[id]) {
> >>>> +                /* Add the processing cycles of rxq to pmd polling it */
> >>>> +                uint64_t proc_cycles = 0;
> 
> >>>> +                for (unsigned idx = 0; idx < PMD_RXQ_INTERVAL_MAX; 
> >>>> idx++)
> {
> >>>> +                    proc_cycles += 
> >>>> dp_netdev_rxq_get_intrvl_cycles(rxqs[i],
> >>>> +                                                                   idx);
> >>>> +                }
> 
> This is ok, but if you can be sure it was already done and result stored in
> PROC_HIST earlier in the function, you could just use
> dp_netdev_rxq_get_cycles(rxqs[i],RXQ_CYCLES_PROC_HIST)
> 
Yeah, I realized it after looking more into the existing code. Will modify it 
in next version. Thanks
> >>>> +                pmd_usage[id] += proc_cycles;
> >>>> +            }
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> >>>> +        uint64_t total_cycles = 0;
> >>>> +
> >>>> +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        /* Get the total pmd cycles for an interval. */
> >>>> +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> >>>> +        /* Estimate the cycles to cover all intervals. */
> >>>> +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> >>>> +        for (int id = 0; id < num; id++) {
> >>>> +            if (pmd->core_id == core_list[id]) {
> >>>> +                if (pmd_usage[id]) {
> >>>> +                    pmd_usage[id] = (pmd_usage[id] * 100) / 
> >>>> total_cycles;
> >>>> +                }
> >>>> +                VLOG_DBG("Core_id:%d, usage:%"PRIu64"\n",
> >>>> +                          pmd->core_id, pmd_usage[id]);
> >>>> +            }
> >>>> +        }
> >>>> +    }
> >>>> +    ret = variance(pmd_usage, num);
> >>>> +
> >>>> +    rr_numa_list_destroy(&rr);
> >>>> +    free(rxqs);
> >>>> +    free(pmd_usage);
> >>>> +    return ret;
> >>>> +}
> >>>> +
> >>>> +static bool
> >>>> +pmd_rebalance_dry_run(struct dp_netdev *dp) {
> >>>> +    struct dp_netdev_pmd_thread *pmd;
> >>>> +    uint64_t *curr_pmd_usage;
> >>>> +
> >>>> +    uint64_t curr_variance;
> >>>> +    uint64_t new_variance;
> >>>> +    uint64_t improvement = 0;
> >>>> +    uint32_t num_pmds;
> >>>> +    uint32_t *pmd_corelist;
> >>>> +    struct rxq_poll *poll, *poll_next;
> >>>> +
> >>>> +    num_pmds = cmap_count(&dp->poll_threads);
> >>>> +
> >>>> +    if (num_pmds > 1) {
> >>>> +        curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t));
> >>>> +        pmd_corelist = xcalloc(num_pmds, sizeof(uint32_t));
> >>>> +    } else {
> >>>> +        return false;
> >>>> +    }
> >>>> +
> >>>> +    num_pmds = 0;
> >>>> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> >>>> +        uint64_t total_cycles = 0;
> >>>> +        uint64_t total_proc = 0;
> >>>> +
> >>>> +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        /* Get the total pmd cycles for an interval. */
> >>>> +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> >>>> +        /* Estimate the cycles to cover all intervals. */
> >>>> +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> >>>> +
> >>>> +        HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->poll_list) {
> >>>> +            uint64_t proc_cycles = 0;
> >>>> +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> >>>> +                proc_cycles += 
> >>>> dp_netdev_rxq_get_intrvl_cycles(poll->rxq,
> i);
> >>>> +            }
> >>>> +            total_proc += proc_cycles;
> >>>> +        }
> >>>> +        if (total_proc) {
> >>>> +            curr_pmd_usage[num_pmds] = (total_proc * 100) /
> total_cycles;
> >>>> +        }
> >>>> +
> >>>> +        VLOG_DBG("PMD_AUTO_LB_MON
> curr_pmd_usage(%d):%"PRIu64"",
> >>>> +                  pmd->core_id, curr_pmd_usage[num_pmds]);
> >>>> +
> >>>> +        if (atomic_count_get(&pmd->pmd_overloaded)) {
> >>>> +            atomic_count_set(&pmd->pmd_overloaded, 0);
> >>>> +        }
> >>>> +
> >>>> +        pmd_corelist[num_pmds] = pmd->core_id;
> >>>> +        num_pmds++;
> >>>> +    }
> >>>> +
> >>>> +    curr_variance = variance(curr_pmd_usage, num_pmds);
> >>>> +
> >>>> +    new_variance = get_dry_run_variance(dp, pmd_corelist,
> num_pmds);
> >>>> +    VLOG_DBG("PMD_AUTO_LB_MON new variance: %"PRIu64","
> >>>> +              " curr_variance: %"PRIu64"",
> >>>> +              new_variance, curr_variance);
> >>>> +
> >>>> +    if (new_variance && (new_variance < curr_variance)) {
> >>>> +        improvement =
> >>>> +            ((curr_variance - new_variance) * 100) /
> >>>> + curr_variance;
> >>>> +
> >>>> +        VLOG_DBG("PMD_AUTO_LB_MON improvement %"PRIu64"",
> >>> improvement);
> >>>> +    }
> >>>> +
> >>>> +    free(curr_pmd_usage);
> >>>> +    free(pmd_corelist);
> >>>> +
> >>>> +    if (improvement >= ACCEPT_IMPROVE_DEFAULT) {
> >>>> +        return true;
> >>>> +    }
> >>>> +
> >>>> +    return false;
> >>>> +}
> >>>> +
> >>>> +
> >>>>  /* Return true if needs to revalidate datapath flows. */  static
> >>>> bool dpif_netdev_run(struct dpif *dpif) @@ -4789,6 +5099,9 @@
> >>>> dpif_netdev_run(struct dpif *dpif)
> >>>>      struct dp_netdev_pmd_thread *non_pmd;
> >>>>      uint64_t new_tnl_seq;
> >>>>      bool need_to_flush = true;
> >>>> +    bool pmd_rebalance = false;
> >>>> +    long long int now = time_msec();
> >>>> +    struct dp_netdev_pmd_thread *pmd;
> >>>>
> >>>>      ovs_mutex_lock(&dp->port_mutex);
> >>>>      non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID); @@ -
> 4821,6
> >>>> +5134,38 @@ dpif_netdev_run(struct dpif *dpif)
> >>>>          dp_netdev_pmd_unref(non_pmd);
> >>>>      }
> >>>>
> >>>> +    struct pmd_auto_lb * pmd_alb = &dp->pmd_alb;
> >>>> +    if (pmd_alb->is_enabled) {
> >>>> +        if (!pmd_alb->rebalance_poll_timer) {
> >>>> +            pmd_alb->rebalance_poll_timer = now;
> >>>> +        } else if ((pmd_alb->rebalance_poll_timer +
> >>>> +                   pmd_alb->rebalance_intvl) < now) {
> >>>> +            pmd_alb->rebalance_poll_timer = now;
> >>>> +            CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> >>>> +                if (atomic_count_get(&pmd->pmd_overloaded) >=
> >>>> +                                    PMD_RXQ_INTERVAL_MAX) {
> >>>> +                    pmd_rebalance = true;
> >>>> +                    break;
> >>>> +                }
> >>>> +            }
> >>>> +            VLOG_DBG("PMD_AUTO_LB_MON periodic check:pmd
> >>> rebalance:%d",
> >>>> +                      pmd_rebalance);
> >>>> +
> >>>> +            if (pmd_rebalance &&
> >>>> +                !dp_netdev_is_reconf_required(dp) &&
> >>>> +                !ports_require_restart(dp) &&
> >>>> +                pmd_rebalance_dry_run(dp)) {
> >>>
> >>> Don't you need the dp_netdev_mutex for call to
> >>> pmd_rebalance_dry_run, or at least for some parts of it?
> >> I will check it.
> >
> > Actually, I think it's not needed
> >
> >>>
> >>>> +
> >>>> +                ovs_mutex_unlock(&dp->port_mutex);
> >>>
> >>> It seems odd to be unlocking this and then taking it again, is there a
> reason?
> >> Need to check it again if we can have both locks at the same time.
> >>>
> >>>> +                ovs_mutex_lock(&dp_netdev_mutex);
> >
> > I don't think you need this lock or to unlock the port_mutex
> >
> >>>> +                VLOG_DBG("PMD_AUTO_LB_MON Invoking PMD
> RECONFIGURE");
> >>>> +                dp_netdev_request_reconfigure(dp);
> >>>> +                ovs_mutex_unlock(&dp_netdev_mutex);
> >>>> +                ovs_mutex_lock(&dp->port_mutex);
> >>>> +            }
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>>      if (dp_netdev_is_reconf_required(dp) || ports_require_restart(dp))
> {
> >>>>          reconfigure_datapath(dp);
> >>>>      }
> >>>> @@ -4979,6 +5324,8 @@ pmd_thread_main(void *f_)
> >>>>  reload:
> >>>>      pmd_alloc_static_tx_qid(pmd);
> >>>>
> >>>> +    atomic_count_init(&pmd->pmd_overloaded, 0);
> >>>> +
> >>>>      /* List port/core affinity */
> >>>>      for (i = 0; i < poll_cnt; i++) {
> >>>>         VLOG_DBG("Core %d processing port \'%s\' with queue-id
> >>>> %d\n", @@ -4986,6 +5333,10 @@ reload:
> >>>>                  netdev_rxq_get_queue_id(poll_list[i].rxq->rx));
> >>>>         /* Reset the rxq current cycles counter. */
> >>>>         dp_netdev_rxq_set_cycles(poll_list[i].rxq,
> >>>> RXQ_CYCLES_PROC_CURR, 0);
> >>>> +
> >>>> +       for (unsigned j = 0; j < PMD_RXQ_INTERVAL_MAX; j++) {
> >>>> +            dp_netdev_rxq_set_intrvl_cycles(poll_list[i].rxq, 0, j);
> >>>> +       }
> >>>
> >>> is it needed for this patch? won't all the values have been
> >>> refreshed by the time the next check is performed anyway
> >> I thought it is safe to reset it. If PMD is reset in middle of cycle then 
> >> it may
> have stale information when dry_run is executed.
> >
> > It shouldn't be reset as it can clear info for some rxqs before
> > rxq-pmd assignment when ports are reconfigured. You can see this in
> > the rxq-pmd assignment logs, when adding/removing rxqs.
> >
> >>>
> >>>>      }
> >>>>
> >>>>      if (!poll_cnt) {
> >>>> @@ -7188,17 +7539,51 @@ dp_netdev_pmd_try_optimize(struct
> >>> dp_netdev_pmd_thread *pmd,
> >>>>                             struct polled_queue *poll_list, int
> >>>> poll_cnt)  {
> >>>>      struct dpcls *cls;
> >>>> +    uint64_t tot_idle = 0, tot_proc = 0;
> >>>> +    unsigned int idx;
> >>>> +    unsigned int pmd_load = 0;
> >>>>
> >>>>      if (pmd->ctx.now > pmd->rxq_next_cycle_store) {
> >>>>          uint64_t curr_tsc;
> >>>> +        struct pmd_auto_lb * pmd_alb = &pmd->dp->pmd_alb;
> >>>> +        if (pmd_alb->is_enabled && !pmd->isolated) {
> >>>> +            tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE]
> -
> >>>> +                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
> >>>> +            tot_proc = pmd-
> >perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
> >>>> +                       pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
> >>>> +
> >>>> +            if (tot_proc) {
> >>>> +                pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
> >>>> +            }
> >>>> +
> >>>> +            if (pmd_load >= PMD_LOAD_THRE_DEFAULT) {
> >>>> +                atomic_count_inc(&pmd->pmd_overloaded);
> >>>> +
> >>>> +                VLOG_DBG("PMD_AUTO_LB_MON PMD OVERLOAD DETECT
> iter
> >>> %d",
> >>>> +                          atomic_count_get(&pmd->pmd_overloaded));
> >>>
> >>> Better to remove this log
> >> Sure
> >>>
> >>>> +            } else {
> >>>> +                atomic_count_set(&pmd->pmd_overloaded, 0);
> >>>> +            }
> >>>> +        }
> >>>> +
> >>>> +        pmd->prev_stats[PMD_CYCLES_ITER_IDLE] =
> >>>> +                        
> >>>> pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
> >>>> +        pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
> >>>> +
> >>>> + pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
> >>>> +
> >>>
> >>> These have been used earlier - are they initialized somewhere for
> >>> the first use? you could just skip the above if() on the first call,
> >>> so they get initialized, or else init them at the start of
> >>> pmd_thread_main()
> >> It would have in initialized to 0 when pmd structure is created. We are not
> resetting it to 0 whenever PMD is reloaded.
> >
> > but isn't the counters.n structure reset everytime the pmd reloads? so
> > it would mean that this should be reset also. Maybe I read the
> > pmd_stats code wrong.
> >
> >>>
> >>>>          /* Get the cycles that were used to process each queue and 
> >>>> store.
> */
> >>>>          for (unsigned i = 0; i < poll_cnt; i++) {
> >>>> -            uint64_t rxq_cyc_curr =
> dp_netdev_rxq_get_cycles(poll_list[i].rxq,
> >>>> -                                                        
> >>>> RXQ_CYCLES_PROC_CURR);
> >>>> -            dp_netdev_rxq_set_intrvl_cycles(poll_list[i].rxq, 
> >>>> rxq_cyc_curr);
> >>>> -            dp_netdev_rxq_set_cycles(poll_list[i].rxq,
> >>> RXQ_CYCLES_PROC_CURR,
> >>>> -                                     0);
> >>>> +            uint64_t rxq_cyc_curr;
> >>>> +            struct dp_netdev_rxq *rxq;
> >>>> +
> >>>> +            rxq = poll_list[i].rxq;
> >>>> +            idx = rxq->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX;
> >>>> +
> >>>> +            rxq_cyc_curr = dp_netdev_rxq_get_cycles(rxq,
> >>> RXQ_CYCLES_PROC_CURR);
> >>>> +            dp_netdev_rxq_set_intrvl_cycles(rxq, rxq_cyc_curr, idx);
> >>>> +            dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_PROC_CURR,
> >>>> + 0);
> >>>>          }
> >>>> +
> >>>>          curr_tsc = cycles_counter_update(&pmd->perf_stats);
> >>>>          if (pmd->intrvl_tsc_prev) {
> >>>>              /* There is a prev timestamp, store a new intrvl cycle
> >>>> count. */ diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> >>>> index 2160910..ff3589c 100644
> >>>> --- a/vswitchd/vswitch.xml
> >>>> +++ b/vswitchd/vswitch.xml
> >>>> @@ -574,6 +574,36 @@
> >>>>              be set to 'skip_sw'.
> >>>>          </p>
> >>>>        </column>
> >>>> +      <column name="other_config" key="pmd-auto-lb"
> >>>> +              type='{"type": "boolean"}'>
> >>>> +        <p>
> >>>> +         Configures PMD Auto Load Balancing that allows automatic
> >>> assignment of
> >>>> +         RX queues to PMDs if any of PMDs is overloaded (i.e.
> >>>> + processing
> >>> cycles
> >>>> +         > 95%).
> >>>> +        </p>
> >>>> +        <p>
> >>>> +         It uses current scheme of cycle based assignment of RX queues
> that
> >>>> +         are not statically pinned to PMDs.
> >>>> +        </p>
> >>>> +        <p>
> >>>> +          Set this value to <code>true</code> to enable this option.
> >>>> +        </p>
> >>>> +        <p>
> >>>> +          The default value is <code>false</code>.
> >>>> +        </p>
> >>>> +        <p>
> >>>> +         This only comes in effect if cycle based assignment is enabled 
> >>>> and
> >>>> +         there are more than one non-isolated PMDs present and
> >>>> + atleast one
> >>> of
> >>>> +         it polls more than one queue.
> >>>> +        </p>
> >>>> +      </column>
> >>>> +      <column name="other_config" key="pmd-auto-lb-rebalance-intvl"
> >>>> +              type='{"type": "integer", "minInteger": 1}'>
> >>>> +        <p>
> >>>> +         The minimum time (in minutes) 2 consecutive PMD Auto Load
> >>> Balancing
> >>>> +         iterations.
> >>>> +        </p>
> >>>> +      </column>
> >>>>      </group>
> >>>>      <group title="Status">
> >>>>        <column name="next_cfg">
> >>>>
> >>
> >


_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2] Adding support for PMD auto load balancing

Reply via email to