On Mon, 2015-09-28 at 13:28 +0300, Kirill Tkhai wrote:

> Looks like, NAK may be better, because it saves L1 cache, while the patch 
> always invalidates it.

Yeah, bounce hurts more when there's no concurrency win waiting to be
collected.  This mixed load wasn't a great choice, but it turned out to
be pretty interesting.  Something waking a gaggle of waiters on a busy
big socket should do very bad things.

> Could you say, do you execute pgbench using just -cX -jY -T30 or something 
> special? I've tried it,
> but the dispersion of the results much differs from time to time.

pgbench -T $testtime -j 1 -S -c $clients

> > Ok, that's what I want to see, full repeat.
> > master = twiddle
> > master+ = twiddle+patch
> > 
> > concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
> >                                              master                         
> >   master+
> > pgbench                   1       2       3     avg         1       2       
> > 3     avg   comp
> > clients 1       tps = 18599   18627   18532   18586     17480   17682   
> > 17606   17589   .946
> > clients 2       tps = 32344   32313   32408   32355     25167   26140   
> > 23730   25012   .773
> > clients 4       tps = 52593   51390   51095   51692     22983   23046   
> > 22427   22818   .441
> > clients 8       tps = 70354   69583   70107   70014     66924   66672   
> > 69310   67635   .966
> > 
> > Hrm... turn the tables, measure tbench while pgbench 4 client load runs 
> > endlessly.
> > 
> >                                              master                         
> >   master+
> > tbench                    1       2       3     avg         1       2       
> > 3     avg   comp
> > pairs 1        MB/s =   430     426     436     430       481     481     
> > 494     485  1.127
> > pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    
> > 1083    1086  1.005
> > pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    
> > 2006    2010  1.170
> > pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    
> > 3071    3021  1.123
> > 
> > tbench without competition
> >                master        master+   comp
> > pairs 1        MB/s =   694     692    .997 
> > pairs 2        MB/s =  1268    1259    .992
> > pairs 4        MB/s =  2210    2165    .979
> > pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine 
> > variance)
> 
> Hm, it seems tbench with competition is better only because of a busy system 
> makes tbench
> processes be woken on the same cpu.

Yeah.  When box is really full, select_idle_sibling() (obviously) turns
into a waste of cycles, but even as you approach that, especially when
filling the box with identical copies of nearly fully synchronous high
frequency localhost packet blasters, stacking is a win.

What bent my head up a bit was the combined effect of making wake_wide()
really keep pgbench from collapsing then adding the affine wakeup grant
for tbench.  It's not at all clear to me why 2,4 would be so demolished.

        -Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to