On 04/12/2022 11:16, Cheng Li wrote:
Currently, pmd_rebalance_dry_run() calculate overall variance of
all pmds regardless of their numa location. The overall result may
hide un-balance in an individual numa.
Hi. Thanks, the idea looks a good one. I didn't test yet, couple of
comments on the code below.
Considering the following case. Numa 0 is free because VMs on numa0
are not sending pkts, while numa 1 is busy. Within numa 1, pmds
workloads are not balanced. Obviously, moving 500 kpps workloads from
pmd 126 to pmd 61 will make numa1 much more balance. For numa1
the variance improment will be almost 100%, because after rebalance
each pmd in numa1 holds same workload(variance ~= 0). But the overall
variance improvement is only about 20%, which may not tigger auto_lb.
```
numa_id core_id kpps
0 30 0
0 31 0
0 94 0
0 95 0
1 126 1500
1 127 1000
1 63 1000
1 62 500
```
As auto_lb doesn't work if any coss_numa rxq exists, it means that
That's not fully true. It can run with cross numa in a very limited
circumstances where there is only PMD active on 1 numa.
It doesn't diminish the idea here, but just best not write that blanket
statement as it may confuse.
auto_lb only balance rxq assignment within each numa. So it makes
more sense to calculate variance improvemnet per numa node.
Signed-off-by: Cheng Li <[email protected]>
---
lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++--------------------------
1 file changed, 47 insertions(+), 43 deletions(-)
I think at least some docs would need to be updated to say it's variance
per NUMA.
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2c08a71..6a53f13 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp)
static uint64_t variance(uint64_t a[], int n);
static uint64_t
-sched_numa_list_variance(struct sched_numa_list *numa_list)
+sched_numa_variance(struct sched_numa *numa)
{
- struct sched_numa *numa;
uint64_t *percent_busy = NULL;
- unsigned total_pmds = 0;
int n_proc = 0;
uint64_t var;
- HMAP_FOR_EACH (numa, node, &numa_list->numas) {
- total_pmds += numa->n_pmds;
- percent_busy = xrealloc(percent_busy,
- total_pmds * sizeof *percent_busy);
+ percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy);
- for (unsigned i = 0; i < numa->n_pmds; i++) {
- struct sched_pmd *sched_pmd;
- uint64_t total_cycles = 0;
+ for (unsigned i = 0; i < numa->n_pmds; i++) {
+ struct sched_pmd *sched_pmd;
+ uint64_t total_cycles = 0;
- sched_pmd = &numa->pmds[i];
- /* Exclude isolated PMDs from variance calculations. */
- if (sched_pmd->isolated == true) {
- continue;
- }
- /* Get the total pmd cycles for an interval. */
- atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
+ sched_pmd = &numa->pmds[i];
+ /* Exclude isolated PMDs from variance calculations. */
+ if (sched_pmd->isolated == true) {
+ continue;
+ }
+ /* Get the total pmd cycles for an interval. */
+ atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
- if (total_cycles) {
- /* Estimate the cycles to cover all intervals. */
- total_cycles *= PMD_INTERVAL_MAX;
- percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
- / total_cycles;
- } else {
- percent_busy[n_proc++] = 0;
- }
+ if (total_cycles) {
+ /* Estimate the cycles to cover all intervals. */
+ total_cycles *= PMD_INTERVAL_MAX;
+ percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
+ / total_cycles;
+ } else {
+ percent_busy[n_proc++] = 0;
}
}
var = variance(percent_busy, n_proc);
@@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
struct sched_numa_list numa_list_est;
bool thresh_met = false;
uint64_t current_var, estimate_var;
+ struct sched_numa *numa_cur, *numa_est;
uint64_t improvement = 0;
VLOG_DBG("PMD auto load balance performing dry run.");
@@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
sched_numa_list_count(&numa_list_est) == 1) {
/* Calculate variances. */
- current_var = sched_numa_list_variance(&numa_list_cur);
- estimate_var = sched_numa_list_variance(&numa_list_est);
-
- if (estimate_var < current_var) {
- improvement = ((current_var - estimate_var) * 100) / current_var;
- }
- VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".",
- current_var, estimate_var);
- VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
-
- if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
- thresh_met = true;
- VLOG_DBG("PMD load variance improvement threshold %u%% "
- "is met.", dp->pmd_alb.rebalance_improve_thresh);
- } else {
- VLOG_DBG("PMD load variance improvement threshold "
- "%u%% is not met.",
- dp->pmd_alb.rebalance_improve_thresh);
+ HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) {
+ numa_est = sched_numa_list_lookup(&numa_list_est,
+ numa_cur->numa_id);
+ if (!numa_est) {
+ continue;
+ }
+ current_var = sched_numa_variance(numa_cur);
+ estimate_var = sched_numa_variance(numa_est);
+ if (estimate_var < current_var) {
+ improvement = ((current_var - estimate_var) * 100)
+ / current_var;
+ }
+ VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current"
+ " variance %"PRIu64" Estimated variance %"PRIu64".",
+ numa_cur->numa_id, improvement,
+ current_var, estimate_var);
Not sure the reason for changing the logging order? Just the NUMA node
needs to be added.
+
+ if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
+ VLOG_DBG("PMD load variance improvement threshold %u%% "
+ "is met.", dp->pmd_alb.rebalance_improve_thresh);
+ thresh_met = true;
+ break;
I think it's better to remove the break here and move the "result"
logging outside the for loop. Just set the thresh_met = true at this
point. So then the logging looks like.
"PMD auto load balance dry run"
"Numa node x. Current y...Estime z...Improve n"
"Numa node x. Current y...Estime z...Improve n"
"PMD load improvement of m is met." or "...not met"
For debug it would be important to continue the run on the other NUMA
even if the threshold is already met, so removing the break; would allow
this.
+ } else {
+ VLOG_DBG("PMD load variance improvement threshold "
+ "%u%% is not met.",
+ dp->pmd_alb.rebalance_improve_thresh);
+ }
}
} else {
VLOG_DBG("PMD auto load balance detected cross-numa polling with "
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev