On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
> We are not interested in actual target if both prev
> and curr cpus share CPU cache. select_idle_sibling()
> searches in top-down order; top level is the same
> for both of them, and the result will be the same.
> So, we can save a little CPU cycles and cache misses
> and skip wake_affine() calculations.

But, whereas previously wake_affine() could NAK a migration if it would
create an imbalance, we'll now just go ahead and stack tasks if
select_idle_sibling() can't find an idle home to override the blanket
approval.  It doesn't look like a good idea to me to bounce tasks around
only to then perhaps stack them, as if we do stack waker/wakee, we
certainly lose concurrency. (microbenchmarks like pipe-test love that,
but not all that many real applications play ping-pong for a living;)

I spent most of the day piddling with your little patch, so I'll post
some condensed mixed load notes.

concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
                                             master                           
master+
pgbench                   1       2       3     avg         1       2       3   
  avg   comp
clients 1       tps = 18768   18591   18264   18541     18351   17257   17245   
17617   .950
clients 2       tps = 30779   30661   31016   30818     29112   28026   29026   
28721   .931
clients 4       tps = 54195   55100   54048   54447     53290   52336   52930   
52852   .970
clients 8       tps = 60332   67052   64699   64027     38491   35746   37746   
37327   .582!!

Do the opposite, wake_affine() always NAKs.
                                             master                           
master++
pgbench                   1       2       3     avg         1       2       3   
  avg   comp
clients 1       tps = 18768   18591   18264   18541     16874   16865   16665   
16801   .906
clients 2       tps = 30779   30661   31016   30818     33562   33546   33681   
33596  1.090
clients 4       tps = 54195   55100   54048   54447     61544   61482   61117   
61381  1.127
clients 8       tps = 60332   67052   64699   64027     75171   75524   75318   
75337  1.176

...

virgin vs your patch again, 2 _minutes_ per client count, as I noticed much 
variance at 8
clients, where wake_wide() is supposed to kick in to keep N:M load spread out.

                                             master                           
master+
pgbench                   1       2       3     avg         1       2       3   
  avg   comp
clients 1       tps = 18548   18673   18390   18537     17879   17652   17621   
17717   .955
clients 2       tps = 31083   31110   30859   31017     30274   30003   29796   
30024   .967
clients 4       tps = 53107   53156   53601   53288     52658   53024   53449   
53043   .995
clients 8       tps = 34213   34310   28844   32455     31360   31416   30732   
31169   .960

30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job 
for 1:N pgbench.

hrmph, twiddle...

waker/wakee coupling strengthened
postgres@homer:~> pgbench.sh
clients 1       tps = 18035
clients 2       tps = 32525
clients 4       tps = 53246
clients 8       tps = 37278

better, but not enough..  + sd_llc_size = #cores vs #threads
postgres@homer:~> pgbench.sh
clients 1       tps = 18482
clients 2       tps = 32366
clients 4       tps = 54557
clients 8       tps = 69643

Ok, that's what I want to see, full repeat.
master = twiddle
master+ = twiddle+patch

concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
                                             master                           
master+
pgbench                   1       2       3     avg         1       2       3   
  avg   comp
clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   
17589   .946
clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   
25012   .773
clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   
22818   .441
clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   
67635   .966

Hrm... turn the tables, measure tbench while pgbench 4 client load runs 
endlessly.

                                             master                           
master+
tbench                    1       2       3     avg         1       2       3   
  avg   comp
pairs 1        MB/s =   430     426     436     430       481     481     494   
  485  1.127
pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083   
 1086  1.005
pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006   
 2010  1.170
pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071   
 3021  1.123

tbench without competition
               master        master+   comp
pairs 1        MB/s =   694     692    .997 
pairs 2        MB/s =  1268    1259    .992
pairs 4        MB/s =  2210    2165    .979
pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)

twiddle:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
 {
        struct sched_domain *sd;
        struct sched_domain *busy_sd = NULL;
+       struct sched_group *group;
        int id = cpu;
        int size = 1;
 
        sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
        if (sd) {
                id = cpumask_first(sched_domain_span(sd));
-               size = cpumask_weight(sched_domain_span(sd));
                busy_sd = sd->parent; /* sd_busy */
+               group = sd->groups;
+               /* Set size to the number of cores, not threads */
+               while (group = group->next, group != sd->groups)
+                       size++;
        }
        rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
 
 static void record_wakee(struct task_struct *p)
 {
+       unsigned long now = jiffies;
+
        /*
         * Rough decay (wiping) for cost saving, don't worry
         * about the boundary, really active task won't care
         * about the loss.
         */
-       if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
+       if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
                current->wakee_flips >>= 1;
-               current->wakee_flip_decay_ts = jiffies;
+               current->wakee_flip_decay_ts = now;
+       }
+       if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
+               p->wakee_flips >>= 1;
+               p->wakee_flip_decay_ts = now;
        }
 
        if (current->last_wakee != p) {
                current->last_wakee = p;
                current->wakee_flips++;
+               p->wakee_flips++;
        }
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to