Hi David, A few comments below,
thanks, Kevin. On 07/06/2021 12:48, David Marchand wrote: > When setting PMD auto load balance parameters and observing the > feature (and the per rxq statistics) in action, you can easily get > rebalances while the sum of per rxq pmd usage is below the pmd load > threshold you had set. > It is useful for ALB yes, but I think it is more a general improvement than just that. That general feedback we got was along the lines of: if I add all the rxq %'s on the PMD, it is not at 100%, but if I turn up the traffic gen the throughput does not increase, is the calculation wrong? etc. > This is because the dpif-netdev/pmd-rxq-show command only reports "pure" > rxq cycles while some cycles are used in the pmd mainloop and adds up to > the total pmd load. > > dpif-netdev/pmd-stats-show does report per pmd load usage. > This load is measured since the last dpif-netdev/pmd-stats-clear call. > On the other hand, the auto load balance feature uses the pmd load on a 10s > sliding window which makes it non trivial to correlate. > > Gather per pmd busy cycles with the same periodicity and report the > difference as overhead in dpif-netdev/pmd-rxq-show so that we have all > info in a single command. > I didn't check what is already in the docs, but it need a few lines like above to say what 'overhead' is. > Example: > $ ovs-appctl dpif-netdev/pmd-rxq-show > pmd thread numa_id 0 core_id 4: > isolated : false > port: dpdk0 queue-id: 0 (enabled) pmd usage: 37 % > port: dpdk1 queue-id: 0 (enabled) pmd usage: 36 % > port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost6 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost7 queue-id: 0 (enabled) pmd usage: 0 % > overhead: 4 % > pmd thread numa_id 0 core_id 18: > isolated : false > port: vhost0 queue-id: 0 (enabled) pmd usage: 37 % > port: vhost1 queue-id: 0 (enabled) pmd usage: 39 % > port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost4 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost5 queue-id: 0 (enabled) pmd usage: 0 % > overhead: 5 % > > Signed-off-by: David Marchand <[email protected]> > --- > Documentation/topics/dpdk/pmd.rst | 6 ++ > lib/dpif-netdev.c | 100 ++++++++++++++++++++---------- > tests/pmd.at | 8 ++- > 3 files changed, 80 insertions(+), 34 deletions(-) > > diff --git a/Documentation/topics/dpdk/pmd.rst > b/Documentation/topics/dpdk/pmd.rst > index e481e79414..2df10dcc5d 100644 > --- a/Documentation/topics/dpdk/pmd.rst > +++ b/Documentation/topics/dpdk/pmd.rst > @@ -213,6 +213,12 @@ load for each of its associated queues every 10 seconds. > If the aggregated PMD > load reaches the load threshold for 6 consecutive intervals then PMD > considers > itself to be overloaded. > > +.. versionchanged:: 2.16.0 > + > + The per PMD load is shown in the same fashion than Rx queue's in the > output > + of ``pmd-rxq-show``. It accounts for all Rx queue's processing cycles and > + internal PMD core main loop cost. > + > For example, to set the load threshold to 70%:: > > $ ovs-vsctl set open_vswitch .\ > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 650e67ab30..bcebcebe0e 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -235,11 +235,11 @@ struct dfc_cache { > > /* Time in microseconds of the interval in which rxq processing cycles used > * in rxq to pmd assignments is measured and stored. */ > -#define PMD_RXQ_INTERVAL_LEN 10000000LL > +#define PMD_INTERVAL_LEN 10000000LL > > /* Number of intervals for which cycles are stored > * and used during rxq to pmd assignment. */ > -#define PMD_RXQ_INTERVAL_MAX 6 > +#define PMD_INTERVAL_MAX 6 > > /* Time in microseconds to try RCU quiescing. */ > #define PMD_RCU_QUIESCE_INTERVAL 10000LL > @@ -465,9 +465,9 @@ struct dp_netdev_rxq { > > /* Counters of cycles spent successfully polling and processing pkts. */ > atomic_ullong cycles[RXQ_N_CYCLES]; > - /* We store PMD_RXQ_INTERVAL_MAX intervals of data for an rxq and then > + /* We store PMD_INTERVAL_MAX intervals of data for an rxq and then > sum them to yield the cycles used for an rxq. */ > - atomic_ullong cycles_intrvl[PMD_RXQ_INTERVAL_MAX]; > + atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX]; > }; > > /* A port in a netdev-based datapath. */ > @@ -703,13 +703,18 @@ struct dp_netdev_pmd_thread { > long long int next_optimization; > /* End of the next time interval for which processing cycles > are stored for each polled rxq. */ > - long long int rxq_next_cycle_store; > + long long int next_cycle_store; > > /* Last interval timestamp. */ > uint64_t intrvl_tsc_prev; > /* Last interval cycles. */ > atomic_ullong intrvl_cycles; > > + /* Write index for 'busy_cycles_intrvl'. */ > + unsigned int intrvl_idx; > + /* Busy cycles in last PMD_INTERVAL_MAX intervals. */ > + atomic_ullong busy_cycles_intrvl[PMD_INTERVAL_MAX]; > + > /* Current context of the PMD thread. */ > struct dp_netdev_pmd_thread_ctx ctx; > > @@ -1229,6 +1234,8 @@ pmd_info_show_rxq(struct ds *reply, struct > dp_netdev_pmd_thread *pmd) > struct rxq_poll *list; > size_t n_rxq; > uint64_t total_cycles = 0; > + uint64_t busy_cycles = 0; > + uint64_t polling_cycles = 0; > > ds_put_format(reply, > "pmd thread numa_id %d core_id %u:\n isolated : %s\n", > @@ -1241,16 +1248,27 @@ pmd_info_show_rxq(struct ds *reply, struct > dp_netdev_pmd_thread *pmd) > /* Get the total pmd cycles for an interval. */ > atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); > /* Estimate the cycles to cover all intervals. */ > - total_cycles *= PMD_RXQ_INTERVAL_MAX; > + total_cycles *= PMD_INTERVAL_MAX; > + > + for (int j = 0; j < PMD_INTERVAL_MAX; j++) { > + uint64_t cycles; > + > + atomic_read_relaxed(&pmd->busy_cycles_intrvl[j], &cycles); > + busy_cycles += cycles; > + } > + if (busy_cycles > total_cycles) { > + busy_cycles = total_cycles; > + } > > for (int i = 0; i < n_rxq; i++) { > struct dp_netdev_rxq *rxq = list[i].rxq; > const char *name = netdev_rxq_get_name(rxq->rx); > uint64_t proc_cycles = 0; > > - for (int j = 0; j < PMD_RXQ_INTERVAL_MAX; j++) { > + for (int j = 0; j < PMD_INTERVAL_MAX; j++) { > proc_cycles += dp_netdev_rxq_get_intrvl_cycles(rxq, j); > } > + polling_cycles += proc_cycles; polling_cycles is a bit vague as it sounds like it could include polling but not receiving packets. Also, there is now more of a mix of pmd and and rxq *_cycles in this fn it can be a little confusing, so it might be worthwhile to distinguish a bit more e.g. for (int j = 0; j < PMD_INTERVAL_MAX; j++) { rxq_proc_cycles += dp_netdev_rxq_get_intrvl_cycles(rxq, j); } total_rxqs_proc_cycles += rxq_proc_cycles; > ds_put_format(reply, " port: %-16s queue-id: %2d", name, > netdev_rxq_get_queue_id(list[i].rxq->rx)); > ds_put_format(reply, " %s", netdev_rxq_enabled(list[i].rxq->rx) > @@ -1265,6 +1283,19 @@ pmd_info_show_rxq(struct ds *reply, struct > dp_netdev_pmd_thread *pmd) > } > ds_put_cstr(reply, "\n"); > } > + > + ds_put_cstr(reply, " overhead: "); You could check n_rxq and only display overhead if it is > 0. Otherwise overhead will stay as 'NOT AVAIL' for pmds with no rxqs. e.g.: # ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 8: isolated : false port: urport queue-id: 0 (enabled) pmd usage: 95 % overhead: 4 % pmd thread numa_id 1 core_id 9: isolated : false overhead: NOT AVAIL > + if (total_cycles) { > + if (polling_cycles > busy_cycles) { > + polling_cycles = busy_cycles; > + } > + ds_put_format(reply, "%2"PRIu64" %%", > + (busy_cycles - polling_cycles) * 100 / > total_cycles); > + } else { > + ds_put_cstr(reply, "NOT AVAIL"); > + } > + ds_put_cstr(reply, "\n"); > + > ovs_mutex_unlock(&pmd->port_mutex); > free(list); > } > @@ -4621,7 +4652,7 @@ static void > dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx, > unsigned long long cycles) > { > - unsigned int idx = rx->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX; > + unsigned int idx = rx->intrvl_idx++ % PMD_INTERVAL_MAX; > atomic_store_relaxed(&rx->cycles_intrvl[idx], cycles); > } > > @@ -5090,7 +5121,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) > OVS_REQUIRES(dp->port_mutex) > > if (assign_cyc) { > /* Sum the queue intervals and store the cycle history. > */ > - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { > + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { > cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); > } > dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, > @@ -5587,7 +5618,7 @@ get_dry_run_variance(struct dp_netdev *dp, uint32_t > *core_list, > } > > /* Sum the queue intervals and store the cycle history. */ > - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { > + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { > cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); > } > dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, > @@ -5648,7 +5679,7 @@ get_dry_run_variance(struct dp_netdev *dp, uint32_t > *core_list, > /* Get the total pmd cycles for an interval. */ > atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); > /* Estimate the cycles to cover all intervals. */ > - total_cycles *= PMD_RXQ_INTERVAL_MAX; > + total_cycles *= PMD_INTERVAL_MAX; > for (int id = 0; id < num_pmds; id++) { > if (pmd->core_id == core_list[id]) { > if (pmd_usage[id]) { > @@ -5707,11 +5738,11 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > /* Get the total pmd cycles for an interval. */ > atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); > /* Estimate the cycles to cover all intervals. */ > - total_cycles *= PMD_RXQ_INTERVAL_MAX; > + total_cycles *= PMD_INTERVAL_MAX; > > ovs_mutex_lock(&pmd->port_mutex); > HMAP_FOR_EACH (poll, node, &pmd->poll_list) { > - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { > + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { > total_proc += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i); > } > } > @@ -5820,7 +5851,7 @@ dpif_netdev_run(struct dpif *dpif) > pmd_alb->rebalance_poll_timer = now; > CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { > if (atomic_count_get(&pmd->pmd_overloaded) >= > - PMD_RXQ_INTERVAL_MAX) { > + PMD_INTERVAL_MAX) { > pmd_rebalance = true; > break; > } > @@ -6030,6 +6061,7 @@ reload: > > pmd->intrvl_tsc_prev = 0; > atomic_store_relaxed(&pmd->intrvl_cycles, 0); > + pmd->intrvl_idx = 0; Not sure it's a good idea resetting this everytime the PMD is reloaded. Maybe it can mean that newer busy_cycles_intrvl[] entries are overwritten and older ones are kept longer, but anyway next comment probably superceeds this one. I did some testing and saw that as busy_cycles_intrvl[] is not cleared on reload, it can temoporarily spike overhead if a busy rxq was moved off the pmd. Even though everything will flush out (and stats won't all line up initially aside from this), it looks a bit alarming to see a high overhead. I tested resetting it here and it removed the spike. > cycles_counter_update(s); > > pmd->next_rcu_quiesce = pmd->ctx.now + PMD_RCU_QUIESCE_INTERVAL; > @@ -6556,7 +6588,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread > *pmd, struct dp_netdev *dp, > pmd_thread_ctx_time_update(pmd); > pmd->next_optimization = pmd->ctx.now + DPCLS_OPTIMIZATION_INTERVAL; > pmd->next_rcu_quiesce = pmd->ctx.now + PMD_RCU_QUIESCE_INTERVAL; > - pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN; > + pmd->next_cycle_store = pmd->ctx.now + PMD_INTERVAL_LEN; > hmap_init(&pmd->poll_list); > hmap_init(&pmd->tx_ports); > hmap_init(&pmd->tnl_port_cache); > @@ -8777,31 +8809,33 @@ dp_netdev_pmd_try_optimize(struct > dp_netdev_pmd_thread *pmd, > uint64_t tot_idle = 0, tot_proc = 0; > unsigned int pmd_load = 0; > > - if (pmd->ctx.now > pmd->rxq_next_cycle_store) { > + if (pmd->ctx.now > pmd->next_cycle_store) { > uint64_t curr_tsc; > uint8_t rebalance_load_trigger; > struct pmd_auto_lb *pmd_alb = &pmd->dp->pmd_alb; > - if (pmd_alb->is_enabled && !pmd->isolated > - && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >= > - pmd->prev_stats[PMD_CYCLES_ITER_IDLE]) > - && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >= > - > pmd->prev_stats[PMD_CYCLES_ITER_BUSY])) > - { > + unsigned int idx; > + > + if (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >= > + pmd->prev_stats[PMD_CYCLES_ITER_IDLE] && > + pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >= > + pmd->prev_stats[PMD_CYCLES_ITER_BUSY]) { > tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] - > pmd->prev_stats[PMD_CYCLES_ITER_IDLE]; > tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] - > pmd->prev_stats[PMD_CYCLES_ITER_BUSY]; > > - if (tot_proc) { > - pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); > - } > + if (pmd_alb->is_enabled && !pmd->isolated) { > + if (tot_proc) { > + pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); > + } > > - atomic_read_relaxed(&pmd_alb->rebalance_load_thresh, > - &rebalance_load_trigger); > - if (pmd_load >= rebalance_load_trigger) { > - atomic_count_inc(&pmd->pmd_overloaded); > - } else { > - atomic_count_set(&pmd->pmd_overloaded, 0); > + atomic_read_relaxed(&pmd_alb->rebalance_load_thresh, > + &rebalance_load_trigger); > + if (pmd_load >= rebalance_load_trigger) { > + atomic_count_inc(&pmd->pmd_overloaded); > + } else { > + atomic_count_set(&pmd->pmd_overloaded, 0); > + } > } > } > > @@ -8824,9 +8858,11 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread > *pmd, > atomic_store_relaxed(&pmd->intrvl_cycles, > curr_tsc - pmd->intrvl_tsc_prev); > } > + idx = pmd->intrvl_idx++ % PMD_INTERVAL_MAX; > + atomic_store_relaxed(&pmd->busy_cycles_intrvl[idx], tot_proc); > pmd->intrvl_tsc_prev = curr_tsc; > /* Start new measuring interval */ > - pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN; > + pmd->next_cycle_store = pmd->ctx.now + PMD_INTERVAL_LEN; > } > > if (pmd->ctx.now > pmd->next_optimization) { > diff --git a/tests/pmd.at b/tests/pmd.at > index cc5371d5a5..256adb83f0 100644 > --- a/tests/pmd.at > +++ b/tests/pmd.at > @@ -72,6 +72,7 @@ CHECK_PMD_THREADS_CREATED() > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], > [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > ]) > > @@ -103,6 +104,7 @@ dummy@ovs-dummy: hit:0 missed:0 > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], > [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > @@ -134,6 +136,7 @@ dummy@ovs-dummy: hit:0 missed:0 > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], > [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > @@ -149,13 +152,13 @@ TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]]) > AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3]) > CHECK_PMD_THREADS_CREATED([2], [], [+$TMP]) > > -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { > printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/\<port:.*AVAIL$/ { > printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > port: p0 queue-id: 0 3 4 7 > port: p0 queue-id: 1 2 5 6 > ]) > > AT_CHECK([ovs-vsctl set Open_vSwitch . > other_config:pmd-rxq-assign=roundrobin]) > -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { > printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/\<port:.*AVAIL$/ { > printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > port: p0 queue-id: 0 2 4 6 > port: p0 queue-id: 1 3 5 7 > ]) > @@ -167,6 +170,7 @@ CHECK_PMD_THREADS_CREATED([1], [], [+$TMP]) > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], > [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
