On 01/04/2019 02:56 PM, Kevin Traynor wrote: > On 01/04/2019 12:39 PM, Nitin Katiyar wrote: >> >> >>> -----Original Message----- >>> From: Kevin Traynor [mailto:[email protected]] >>> Sent: Friday, January 04, 2019 1:48 AM >>> To: Nitin Katiyar <[email protected]>; [email protected] >>> Cc: Rohith Basavaraja <[email protected]>; Venkatesan Pradeep >>> <[email protected]> >>> Subject: Re: [PATCH v2] Adding support for PMD auto load balancing >>> >>> On 01/03/2019 12:36 PM, Nitin Katiyar wrote: >>>> Port rx queues that have not been statically assigned to PMDs are >>>> currently assigned based on periodically sampled load measurements. >>>> The assignment is performed at specific instances – port addition, >>>> port deletion, upon reassignment request via CLI etc. >>>> >>>> Due to change in traffic pattern over time it can cause uneven load >>>> among the PMDs and thus resulting in lower overall throughout. >>>> >>>> This patch enables the support of auto load balancing of PMDs based on >>>> measured load of RX queues. Each PMD measures the processing load for >>>> each of its associated queues every 10 seconds. If the aggregated PMD >>>> load reaches 95% for 6 consecutive intervals then PMD considers itself to >>> be overloaded. >>>> >>>> If any PMD is overloaded, a dry-run of the PMD assignment algorithm is >>>> performed by OVS main thread. The dry-run does NOT change the existing >>>> queue to PMD assignments. >>>> >>>> If the resultant mapping of dry-run indicates an improved distribution >>>> of the load then the actual reassignment will be performed. >>>> >>>> The automatic rebalancing will be disabled by default and has to be >>>> enabled via configuration option. The interval (in minutes) between >>>> two consecutive rebalancing can also be configured via CLI, default is >>>> 1 min. >>>> >>>> Following example commands can be used to set the auto-lb params: >>>> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true" >>>> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5" >>>> >>> >>> Hi Nitin, thanks for v2. Not full review yet but sending some comments >>> below. >>>
Additional minor comment below, thanks. >>> Maybe you can put some of the above into a new section below this >>> http://docs.openvswitch.org/en/latest/topics/dpdk/pmd/#port-rx-queue- >>> assigment-to-pmd-threads >> Sure, I will update that too. >>> >>> I also think this feature should be experimental until it has been road >>> tested a >>> bit more. >>> >>>> Co-authored-by: Rohith Basavaraja <[email protected]> >>>> Co-authored-by: Venkatesan Pradeep >>> <[email protected]> >>>> Signed-off-by: Rohith Basavaraja <[email protected]> >>>> Signed-off-by: Venkatesan Pradeep <[email protected]> >>>> Signed-off-by: Nitin Katiyar <[email protected]> >>>> --- >>>> lib/dpif-netdev.c | 403 >>> +++++++++++++++++++++++++++++++++++++++++++++++++-- >>>> vswitchd/vswitch.xml | 30 ++++ >>>> 2 files changed, 424 insertions(+), 9 deletions(-) >>>> >>> >>> There seems to be windows style line endings in the patch? or something else >>> has gone wrong for me. >>> >>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index >>>> 1564db9..8db106f 100644 >>>> --- a/lib/dpif-netdev.c >>>> +++ b/lib/dpif-netdev.c >>>> @@ -80,6 +80,12 @@ >>>> >>>> VLOG_DEFINE_THIS_MODULE(dpif_netdev); >>>> >>>> +/* Auto Load Balancing Defaults */ >>>> +#define ACCEPT_IMPROVE_DEFAULT (25) >>>> +#define PMD_LOAD_THRE_DEFAULT (95) >>> >>> Probably you should remove the brackets above to be consistent with the >>> others below and in the rest of the file. >>> >>>> +#define PMD_REBALANCE_POLL_INTERVAL 1 /* 1 Min */ >>>> +#define MIN_TO_MSEC 60000 >>>> + >>>> #define FLOW_DUMP_MAX_BATCH 50 >>>> /* Use per thread recirc_depth to prevent recirculation loop. */ >>>> #define MAX_RECIRC_DEPTH 6 @@ -288,6 +294,13 @@ struct dp_meter { >>>> struct dp_meter_band bands[]; >>>> }; >>>> >>>> +struct pmd_auto_lb { >>>> + bool auto_lb_conf; /* enable-disable auto load balancing */ >>> >>> I'm not sure what '_conf' is short for? maybe it should be something like >>> 'auto_lb_requested' >> Sure >>> >>>> + bool is_enabled; /* auto_lb current status */ >>> >>> Comments should be of style /* Sentence case. */ >>> http://docs.openvswitch.org/en/latest/internals/contributing/coding- >>> style/#comments >>> >> Thanks for providing the link. I will update in next version >>> >>>> + uint64_t rebalance_intvl; >>>> + uint64_t rebalance_poll_timer; >>>> +}; >>>> + >>>> /* Datapath based on the network device interface from netdev.h. >>>> * >>>> * >>>> @@ -368,6 +381,7 @@ struct dp_netdev { >>>> uint64_t last_tnl_conf_seq; >>>> >>>> struct conntrack conntrack; >>>> + struct pmd_auto_lb pmd_alb; >>>> }; >>>> >>>> static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id) >>>> @@ -682,6 +696,7 @@ struct dp_netdev_pmd_thread { >>>> struct ovs_mutex port_mutex; /* Mutex for 'poll_list' and >>>> 'tx_ports'. */ >>>> /* List of rx queues to poll. */ >>>> struct hmap poll_list OVS_GUARDED; >>>> + >>> >>> Unrelated newline should be removed >>> >>>> /* Map of 'tx_port's used for transmission. Written by the main >>>> thread, >>>> * read by the pmd thread. */ >>>> struct hmap tx_ports OVS_GUARDED; @@ -702,6 +717,11 @@ struct >>>> dp_netdev_pmd_thread { >>>> /* Keep track of detailed PMD performance statistics. */ >>>> struct pmd_perf_stats perf_stats; >>>> >>>> + /* Some stats from previous iteration used by automatic pmd >>>> + load balance logic. */ >>> >>> Nit, but see coding stds. and other multi-line comments wrt style >>> >>>> + uint64_t prev_stats[PMD_N_STATS];> + atomic_count >>> pmd_overloaded; >>>> + >>>> /* Set to true if the pmd thread needs to be reloaded. */ >>>> bool need_reload; >>>> }; >>>> @@ -792,9 +812,11 @@ dp_netdev_rxq_get_cycles(struct dp_netdev_rxq >>> *rx, >>>> enum rxq_cycles_counter_type type); static >>>> void dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx, >>>> - unsigned long long cycles); >>>> + unsigned long long cycles, >>>> + unsigned idx); >>>> static uint64_t >>>> -dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned >>>> idx); >>>> +dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, >>>> + unsigned idx); >>> >>> no need to change dp_netdev_rxq_get_intrvl_cycles() >>> >>>> static void >>>> dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread >>> *pmd, >>>> bool purge); @@ -3734,6 +3756,51 @@ >>>> dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops, >>>> } >>>> } >>>> >>>> +/* Enable/Disable PMD auto load balancing */ static void >>>> +set_pmd_auto_lb(struct dp_netdev *dp) { >>>> + unsigned int cnt = 0; >>>> + struct dp_netdev_pmd_thread *pmd; >>>> + struct pmd_auto_lb * pmd_alb = &dp->pmd_alb; >>>> + >>>> + bool enable = false; >>>> + bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc; >>>> + >>>> + /* Ensure that there is at least 2 non-isolated PMDs and >>>> + * one of the PMD is polling more than one rxq >>>> + */ >>>> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { >>>> + if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) { >>>> + continue; >>>> + } >>>> + >>>> + cnt++; >>>> + if (hmap_count(&pmd->poll_list) > 1) { >>>> + if (enable && (cnt > 1)) { >>>> + break; >>>> + } else { >>>> + enable = true; >>>> + } >>>> + } >>>> + } >>>> + >>> >>> Won't this give the wrong result if there is one pmd with multiple rxq's? >>> How >>> about something in the loop like, >> Yes, you are right. Thanks for catching this. >>> >>> if (hmap_count(&pmd->poll_list) > 1) { >>> multirxq = true; >>> } >>> if (cnt && multirxq) { >>> enable = true; >>> break; >>> } >>> cnt++; >>> >>>> + /* Enable auto LB if it is configured and cycle based assignment is >>>> true */ >>>> + enable = enable && pmd_rxq_assign_cyc && pmd_alb->auto_lb_conf; >>>> + >>>> + if (pmd_alb->is_enabled != enable) { >>>> + pmd_alb->is_enabled = enable; >>>> + if (pmd_alb->is_enabled) { >>>> + VLOG_INFO("PMD auto lb is enabled, rebalance >>>> intvl:%lu(msec)\n", >>>> + pmd_alb->rebalance_intvl); >>>> + } else { >>>> + pmd_alb->rebalance_poll_timer = 0; >>>> + VLOG_INFO("PMD auto lb is disabled\n"); >>>> + } >>>> + } >>>> + >>>> +} >>>> + >>>> /* Applies datapath configuration from the database. Some of the changes >>> are >>>> * actually applied in dpif_netdev_run(). */ static int @@ -3748,6 >>>> +3815,7 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap >>> *other_config) >>>> DEFAULT_EM_FLOW_INSERT_INV_PROB); >>>> uint32_t insert_min, cur_min; >>>> uint32_t tx_flush_interval, cur_tx_flush_interval; >>>> + uint64_t rebalance_intvl; >>>> >>>> tx_flush_interval = smap_get_int(other_config, "tx-flush-interval", >>>> DEFAULT_TX_FLUSH_INTERVAL); @@ >>>> -3819,6 +3887,23 @@ dpif_netdev_set_config(struct dpif *dpif, const >>> struct smap *other_config) >>>> pmd_rxq_assign); >>>> dp_netdev_request_reconfigure(dp); >>>> } >>>> + >>>> + struct pmd_auto_lb * pmd_alb = &dp->pmd_alb; >>>> + pmd_alb->auto_lb_conf = smap_get_bool(other_config, "pmd-auto-lb", >>>> + false); >>>> + >>>> + rebalance_intvl = smap_get_int(other_config, "pmd-auto-lb-rebalance- >>> intvl", >>>> + PMD_REBALANCE_POLL_INTERVAL); >>>> + >>>> + /* Input is in min, convert it to msec */ >>>> + rebalance_intvl = >>>> + rebalance_intvl ? rebalance_intvl * MIN_TO_MSEC : >>>> + MIN_TO_MSEC; >>>> + >>> >>> This is creating a default when the user sets 0 - that needs to be >>> documented. >>> >>> With current values, this could overflow rebalance_intvl. The user value is >>> in >>> minutes, so suggest to limit user input to some reasonable value like 1 >>> week in >>> minutes, and then the min to msec can be safe. See tx-flush-interval as an >>> example where the range is limited. >> Thanks, I will update the documentation. >>> >>>> + if (pmd_alb->rebalance_intvl != rebalance_intvl) { >>>> + pmd_alb->rebalance_intvl = rebalance_intvl; >>>> + } >>>> + >>>> + set_pmd_auto_lb(dp); >>>> return 0; >>>> } >>>> >>>> @@ -3974,9 +4059,9 @@ dp_netdev_rxq_get_cycles(struct >>> dp_netdev_rxq >>>> *rx, >>>> >>>> static void >>>> dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx, >>>> - unsigned long long cycles) >>>> + unsigned long long cycles, >>>> + unsigned idx) >>>> { >>>> - unsigned int idx = rx->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX; >>>> atomic_store_relaxed(&rx->cycles_intrvl[idx], cycles); } >>>> >>>> @@ -4762,6 +4847,9 @@ reconfigure_datapath(struct dp_netdev *dp) >>>> >>>> /* Reload affected pmd threads. */ >>>> reload_affected_pmds(dp); >>>> + >>>> + /* Check if PMD Auto LB is to be enabled */ >>>> + set_pmd_auto_lb(dp); >>>> } >>>> >>>> /* Returns true if one of the netdevs in 'dp' requires a >>>> reconfiguration */ @@ -4780,6 +4868,228 @@ >>> ports_require_restart(const struct dp_netdev *dp) >>>> return false; >>>> } >>>> >>>> +/* Function for calculating variance */ static uint64_t >>>> +variance(uint64_t a[], int n) { >>>> + /* Compute mean (average of elements) */ >>>> + uint64_t sum = 0; >>>> + uint64_t mean = 0; >>>> + uint64_t sqDiff = 0; >>>> + >>>> + if (!n) { >>>> + return 0; >>>> + } >>>> + >>>> + for (int i = 0; i < n; i++) { >>>> + sum += a[i]; >>>> + } >>>> + >>>> + if (sum) { >>>> + mean = sum / n; >>>> + >>>> + /* Compute sum squared differences with mean. */ >>>> + for (int i = 0; i < n; i++) { >>>> + sqDiff += (a[i] - mean)*(a[i] - mean); >>>> + } >>>> + } >>>> + return (sqDiff ? (sqDiff / n) : 0); } >>>> + >>>> +static uint64_t >>>> +get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list, >>>> +uint32_t num) >>> >>> I think this function would require the port_mutex. >> I will check and add in next version. >>> >>>> +{ >>>> + struct dp_netdev_port *port; >>>> + struct dp_netdev_pmd_thread *pmd; >>>> + struct dp_netdev_rxq ** rxqs = NULL; >>>> + struct rr_numa *numa = NULL; >>>> + struct rr_numa_list rr; >>>> + int n_rxqs = 0; >>>> + uint64_t ret = 0; >>>> + uint64_t *pmd_usage; >>>> + >>>> + pmd_usage = xcalloc(num, sizeof(uint64_t)); >>>> + >>>> + HMAP_FOR_EACH (port, node, &dp->ports) { >>>> + if (!netdev_is_pmd(port->netdev)) { >>>> + continue; >>>> + } >>>> + >>>> + for (int qid = 0; qid < port->n_rxq; qid++) { >>>> + struct dp_netdev_rxq *q = &port->rxqs[qid]; >>>> + uint64_t cycle_hist = 0; >>>> + >>>> + if (q->pmd->isolated) { >>>> + continue; >>>> + } >>>> + >>>> + if (n_rxqs == 0) { >>>> + rxqs = xmalloc(sizeof *rxqs); >>>> + } else { >>>> + rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1)); >>>> + } >>>> + >>>> + /* Sum the queue intervals and store the cycle history. */ >>>> + for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { >>>> + cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); >>>> + } >>>> + /* Do we need to add intrvl_cycles in history?? >>> >>> If you want to use compare_rxq_cycles() then you have to put them in >>> history, >>> but it's only used for that and re-written everytime, so I don't think it is >>> harmful. >> Yeah, that is the objective of adding it. >>> >>>> + * but then we should clear interval cycles also */ >>> >>> I don't think you should be clearing interval cycles in a dry run, >>> otherwise they >>> will be reset if the real rebalance occurs. >> Thanks for clarifying it. I will remove the comment. >>> >>>> + dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, >>>> + cycle_hist); >>>> + /* Store the queue. */ >>>> + rxqs[n_rxqs++] = q; >>>> + } >>>> + } >>>> + if (n_rxqs > 1) { >>>> + /* Sort the queues in order of the processing cycles >>>> + * they consumed during their last pmd interval. */ >>>> + qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles); >>>> + } >>>> + rr_numa_list_populate(dp, &rr); >>>> + >>>> + for (int i = 0; i < n_rxqs; i++) { >>>> + int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev); >>>> + numa = rr_numa_list_lookup(&rr, numa_id); >>>> + if (!numa) { >>>> + /* Don't consider queues across NUMA ???*/ >>> >>> I think you should abort the whole dry run process if this is happening >> Okay. >>> >>>> + continue; >>>> + } >>>> + >>>> + pmd = rr_numa_get_pmd(numa, true); >>>> + VLOG_DBG("PMD AUTO_LB:Core %d on numa node %d assigned port >>> \'%s\' " >>>> + "rx queue %d " >>>> + "(measured processing cycles %"PRIu64").", >>>> + pmd->core_id, numa_id, >>>> + netdev_rxq_get_name(rxqs[i]->rx), >>>> + netdev_rxq_get_queue_id(rxqs[i]->rx), >>>> + dp_netdev_rxq_get_cycles(rxqs[i], >>>> + RXQ_CYCLES_PROC_HIST)); >>>> + >>>> + for (int id = 0; id < num; id++) { >>>> + if (pmd->core_id == core_list[id]) { >>>> + /* Add the processing cycles of rxq to pmd polling it */ >>>> + uint64_t proc_cycles = 0; >>>> + for (unsigned idx = 0; idx < PMD_RXQ_INTERVAL_MAX; idx++) >>>> { >>>> + proc_cycles += >>>> dp_netdev_rxq_get_intrvl_cycles(rxqs[i], >>>> + idx); >>>> + } This is ok, but if you can be sure it was already done and result stored in PROC_HIST earlier in the function, you could just use dp_netdev_rxq_get_cycles(rxqs[i],RXQ_CYCLES_PROC_HIST) >>>> + pmd_usage[id] += proc_cycles; >>>> + } >>>> + } >>>> + } >>>> + >>>> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { >>>> + uint64_t total_cycles = 0; >>>> + >>>> + if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) { >>>> + continue; >>>> + } >>>> + >>>> + /* Get the total pmd cycles for an interval. */ >>>> + atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); >>>> + /* Estimate the cycles to cover all intervals. */ >>>> + total_cycles *= PMD_RXQ_INTERVAL_MAX; >>>> + for (int id = 0; id < num; id++) { >>>> + if (pmd->core_id == core_list[id]) { >>>> + if (pmd_usage[id]) { >>>> + pmd_usage[id] = (pmd_usage[id] * 100) / total_cycles; >>>> + } >>>> + VLOG_DBG("Core_id:%d, usage:%"PRIu64"\n", >>>> + pmd->core_id, pmd_usage[id]); >>>> + } >>>> + } >>>> + } >>>> + ret = variance(pmd_usage, num); >>>> + >>>> + rr_numa_list_destroy(&rr); >>>> + free(rxqs); >>>> + free(pmd_usage); >>>> + return ret; >>>> +} >>>> + >>>> +static bool >>>> +pmd_rebalance_dry_run(struct dp_netdev *dp) { >>>> + struct dp_netdev_pmd_thread *pmd; >>>> + uint64_t *curr_pmd_usage; >>>> + >>>> + uint64_t curr_variance; >>>> + uint64_t new_variance; >>>> + uint64_t improvement = 0; >>>> + uint32_t num_pmds; >>>> + uint32_t *pmd_corelist; >>>> + struct rxq_poll *poll, *poll_next; >>>> + >>>> + num_pmds = cmap_count(&dp->poll_threads); >>>> + >>>> + if (num_pmds > 1) { >>>> + curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t)); >>>> + pmd_corelist = xcalloc(num_pmds, sizeof(uint32_t)); >>>> + } else { >>>> + return false; >>>> + } >>>> + >>>> + num_pmds = 0; >>>> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { >>>> + uint64_t total_cycles = 0; >>>> + uint64_t total_proc = 0; >>>> + >>>> + if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) { >>>> + continue; >>>> + } >>>> + >>>> + /* Get the total pmd cycles for an interval. */ >>>> + atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); >>>> + /* Estimate the cycles to cover all intervals. */ >>>> + total_cycles *= PMD_RXQ_INTERVAL_MAX; >>>> + >>>> + HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->poll_list) { >>>> + uint64_t proc_cycles = 0; >>>> + for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { >>>> + proc_cycles += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, >>>> i); >>>> + } >>>> + total_proc += proc_cycles; >>>> + } >>>> + if (total_proc) { >>>> + curr_pmd_usage[num_pmds] = (total_proc * 100) / total_cycles; >>>> + } >>>> + >>>> + VLOG_DBG("PMD_AUTO_LB_MON curr_pmd_usage(%d):%"PRIu64"", >>>> + pmd->core_id, curr_pmd_usage[num_pmds]); >>>> + >>>> + if (atomic_count_get(&pmd->pmd_overloaded)) { >>>> + atomic_count_set(&pmd->pmd_overloaded, 0); >>>> + } >>>> + >>>> + pmd_corelist[num_pmds] = pmd->core_id; >>>> + num_pmds++; >>>> + } >>>> + >>>> + curr_variance = variance(curr_pmd_usage, num_pmds); >>>> + >>>> + new_variance = get_dry_run_variance(dp, pmd_corelist, num_pmds); >>>> + VLOG_DBG("PMD_AUTO_LB_MON new variance: %"PRIu64"," >>>> + " curr_variance: %"PRIu64"", >>>> + new_variance, curr_variance); >>>> + >>>> + if (new_variance && (new_variance < curr_variance)) { >>>> + improvement = >>>> + ((curr_variance - new_variance) * 100) / curr_variance; >>>> + >>>> + VLOG_DBG("PMD_AUTO_LB_MON improvement %"PRIu64"", >>> improvement); >>>> + } >>>> + >>>> + free(curr_pmd_usage); >>>> + free(pmd_corelist); >>>> + >>>> + if (improvement >= ACCEPT_IMPROVE_DEFAULT) { >>>> + return true; >>>> + } >>>> + >>>> + return false; >>>> +} >>>> + >>>> + >>>> /* Return true if needs to revalidate datapath flows. */ static bool >>>> dpif_netdev_run(struct dpif *dpif) @@ -4789,6 +5099,9 @@ >>>> dpif_netdev_run(struct dpif *dpif) >>>> struct dp_netdev_pmd_thread *non_pmd; >>>> uint64_t new_tnl_seq; >>>> bool need_to_flush = true; >>>> + bool pmd_rebalance = false; >>>> + long long int now = time_msec(); >>>> + struct dp_netdev_pmd_thread *pmd; >>>> >>>> ovs_mutex_lock(&dp->port_mutex); >>>> non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID); @@ -4821,6 >>>> +5134,38 @@ dpif_netdev_run(struct dpif *dpif) >>>> dp_netdev_pmd_unref(non_pmd); >>>> } >>>> >>>> + struct pmd_auto_lb * pmd_alb = &dp->pmd_alb; >>>> + if (pmd_alb->is_enabled) { >>>> + if (!pmd_alb->rebalance_poll_timer) { >>>> + pmd_alb->rebalance_poll_timer = now; >>>> + } else if ((pmd_alb->rebalance_poll_timer + >>>> + pmd_alb->rebalance_intvl) < now) { >>>> + pmd_alb->rebalance_poll_timer = now; >>>> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { >>>> + if (atomic_count_get(&pmd->pmd_overloaded) >= >>>> + PMD_RXQ_INTERVAL_MAX) { >>>> + pmd_rebalance = true; >>>> + break; >>>> + } >>>> + } >>>> + VLOG_DBG("PMD_AUTO_LB_MON periodic check:pmd >>> rebalance:%d", >>>> + pmd_rebalance); >>>> + >>>> + if (pmd_rebalance && >>>> + !dp_netdev_is_reconf_required(dp) && >>>> + !ports_require_restart(dp) && >>>> + pmd_rebalance_dry_run(dp)) { >>> >>> Don't you need the dp_netdev_mutex for call to pmd_rebalance_dry_run, or >>> at least for some parts of it? >> I will check it. > > Actually, I think it's not needed > >>> >>>> + >>>> + ovs_mutex_unlock(&dp->port_mutex); >>> >>> It seems odd to be unlocking this and then taking it again, is there a >>> reason? >> Need to check it again if we can have both locks at the same time. >>> >>>> + ovs_mutex_lock(&dp_netdev_mutex); > > I don't think you need this lock or to unlock the port_mutex > >>>> + VLOG_DBG("PMD_AUTO_LB_MON Invoking PMD RECONFIGURE"); >>>> + dp_netdev_request_reconfigure(dp); >>>> + ovs_mutex_unlock(&dp_netdev_mutex); >>>> + ovs_mutex_lock(&dp->port_mutex); >>>> + } >>>> + } >>>> + } >>>> + >>>> if (dp_netdev_is_reconf_required(dp) || ports_require_restart(dp)) { >>>> reconfigure_datapath(dp); >>>> } >>>> @@ -4979,6 +5324,8 @@ pmd_thread_main(void *f_) >>>> reload: >>>> pmd_alloc_static_tx_qid(pmd); >>>> >>>> + atomic_count_init(&pmd->pmd_overloaded, 0); >>>> + >>>> /* List port/core affinity */ >>>> for (i = 0; i < poll_cnt; i++) { >>>> VLOG_DBG("Core %d processing port \'%s\' with queue-id %d\n", >>>> @@ -4986,6 +5333,10 @@ reload: >>>> netdev_rxq_get_queue_id(poll_list[i].rxq->rx)); >>>> /* Reset the rxq current cycles counter. */ >>>> dp_netdev_rxq_set_cycles(poll_list[i].rxq, >>>> RXQ_CYCLES_PROC_CURR, 0); >>>> + >>>> + for (unsigned j = 0; j < PMD_RXQ_INTERVAL_MAX; j++) { >>>> + dp_netdev_rxq_set_intrvl_cycles(poll_list[i].rxq, 0, j); >>>> + } >>> >>> is it needed for this patch? won't all the values have been refreshed by the >>> time the next check is performed anyway >> I thought it is safe to reset it. If PMD is reset in middle of cycle then it >> may have stale information when dry_run is executed. > > It shouldn't be reset as it can clear info for some rxqs before rxq-pmd > assignment when ports are reconfigured. You can see this in the rxq-pmd > assignment logs, when adding/removing rxqs. > >>> >>>> } >>>> >>>> if (!poll_cnt) { >>>> @@ -7188,17 +7539,51 @@ dp_netdev_pmd_try_optimize(struct >>> dp_netdev_pmd_thread *pmd, >>>> struct polled_queue *poll_list, int >>>> poll_cnt) { >>>> struct dpcls *cls; >>>> + uint64_t tot_idle = 0, tot_proc = 0; >>>> + unsigned int idx; >>>> + unsigned int pmd_load = 0; >>>> >>>> if (pmd->ctx.now > pmd->rxq_next_cycle_store) { >>>> uint64_t curr_tsc; >>>> + struct pmd_auto_lb * pmd_alb = &pmd->dp->pmd_alb; >>>> + if (pmd_alb->is_enabled && !pmd->isolated) { >>>> + tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] - >>>> + pmd->prev_stats[PMD_CYCLES_ITER_IDLE]; >>>> + tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] - >>>> + pmd->prev_stats[PMD_CYCLES_ITER_BUSY]; >>>> + >>>> + if (tot_proc) { >>>> + pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); >>>> + } >>>> + >>>> + if (pmd_load >= PMD_LOAD_THRE_DEFAULT) { >>>> + atomic_count_inc(&pmd->pmd_overloaded); >>>> + >>>> + VLOG_DBG("PMD_AUTO_LB_MON PMD OVERLOAD DETECT iter >>> %d", >>>> + atomic_count_get(&pmd->pmd_overloaded)); >>> >>> Better to remove this log >> Sure >>> >>>> + } else { >>>> + atomic_count_set(&pmd->pmd_overloaded, 0); >>>> + } >>>> + } >>>> + >>>> + pmd->prev_stats[PMD_CYCLES_ITER_IDLE] = >>>> + pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE]; >>>> + pmd->prev_stats[PMD_CYCLES_ITER_BUSY] = >>>> + >>>> + pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY]; >>>> + >>> >>> These have been used earlier - are they initialized somewhere for the first >>> use? you could just skip the above if() on the first call, so they get >>> initialized, >>> or else init them at the start of pmd_thread_main() >> It would have in initialized to 0 when pmd structure is created. We are not >> resetting it to 0 whenever PMD is reloaded. > > but isn't the counters.n structure reset everytime the pmd reloads? so > it would mean that this should be reset also. Maybe I read the pmd_stats > code wrong. > >>> >>>> /* Get the cycles that were used to process each queue and store. >>>> */ >>>> for (unsigned i = 0; i < poll_cnt; i++) { >>>> - uint64_t rxq_cyc_curr = >>>> dp_netdev_rxq_get_cycles(poll_list[i].rxq, >>>> - >>>> RXQ_CYCLES_PROC_CURR); >>>> - dp_netdev_rxq_set_intrvl_cycles(poll_list[i].rxq, >>>> rxq_cyc_curr); >>>> - dp_netdev_rxq_set_cycles(poll_list[i].rxq, >>> RXQ_CYCLES_PROC_CURR, >>>> - 0); >>>> + uint64_t rxq_cyc_curr; >>>> + struct dp_netdev_rxq *rxq; >>>> + >>>> + rxq = poll_list[i].rxq; >>>> + idx = rxq->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX; >>>> + >>>> + rxq_cyc_curr = dp_netdev_rxq_get_cycles(rxq, >>> RXQ_CYCLES_PROC_CURR); >>>> + dp_netdev_rxq_set_intrvl_cycles(rxq, rxq_cyc_curr, idx); >>>> + dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_PROC_CURR, 0); >>>> } >>>> + >>>> curr_tsc = cycles_counter_update(&pmd->perf_stats); >>>> if (pmd->intrvl_tsc_prev) { >>>> /* There is a prev timestamp, store a new intrvl cycle >>>> count. */ diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml >>>> index 2160910..ff3589c 100644 >>>> --- a/vswitchd/vswitch.xml >>>> +++ b/vswitchd/vswitch.xml >>>> @@ -574,6 +574,36 @@ >>>> be set to 'skip_sw'. >>>> </p> >>>> </column> >>>> + <column name="other_config" key="pmd-auto-lb" >>>> + type='{"type": "boolean"}'> >>>> + <p> >>>> + Configures PMD Auto Load Balancing that allows automatic >>> assignment of >>>> + RX queues to PMDs if any of PMDs is overloaded (i.e. processing >>> cycles >>>> + > 95%). >>>> + </p> >>>> + <p> >>>> + It uses current scheme of cycle based assignment of RX queues >>>> that >>>> + are not statically pinned to PMDs. >>>> + </p> >>>> + <p> >>>> + Set this value to <code>true</code> to enable this option. >>>> + </p> >>>> + <p> >>>> + The default value is <code>false</code>. >>>> + </p> >>>> + <p> >>>> + This only comes in effect if cycle based assignment is enabled >>>> and >>>> + there are more than one non-isolated PMDs present and atleast one >>> of >>>> + it polls more than one queue. >>>> + </p> >>>> + </column> >>>> + <column name="other_config" key="pmd-auto-lb-rebalance-intvl" >>>> + type='{"type": "integer", "minInteger": 1}'> >>>> + <p> >>>> + The minimum time (in minutes) 2 consecutive PMD Auto Load >>> Balancing >>>> + iterations. >>>> + </p> >>>> + </column> >>>> </group> >>>> <group title="Status"> >>>> <column name="next_cfg"> >>>> >> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
