On 02/07/2018 12:42 AM, Peter Zijlstra wrote:
On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote:

I meant the SMT balance patch. That does comparison with only one other
random core and takes the decision in O(1). Any potential scan of all cores
or cpus is O(n) and doesn't scale and will only get worse in future. That
applies to both select_idle_core() and select_idle_cpu().
We only do the full scan if we think to know there is indeed an idle
core to be had, and if there are idle cores the machine isn't terribly
busy.

If there are no idle cores, we do not in fact scan everything. We limit
the amount of scanning based on the average idle time with a minimum of
4.
This logic may not be working as well as you may think it to be. I had
sent the cost of select_idle_sibling() w/ and w/o my patch and there was
huge difference:

Following is the cost (in us) of select_idle_sibling() with hackbench 16
groups:

function                 baseline-rc6  %stdev patch               %stdev
select_idle_sibling()    0.556         1.72    0.263 (-52.70%)     0.78

(Except when you switch off things like SIS_PROP, then you scan
unconditionally and reduce tail latency at the cost of throughput --
like said, some people want this).

O(1) sounds nice, but it has horrible worst case latencies.

And like I said, I don't see how a rotor is particularly more random
than the previous cpu the task you happen to be waking ran on.
The rotor randomness is not the main point here. I think the benchmark
improvements come from the fact that select_idle_sibling() cost has reduced
a lot. To reduce it while still maintaining good spread of threads can be
achieved by this SMT balance scheme which in turn requires a fast decent
random number generator and rotor is just an easy way to achieve that.
Is there any reason this randomized approach is not acceptable even if
benchmarks show improvement? Are there other benchmarks I should try?
Looking at just one other core has a fairly high chance of not finding
idle. I really cannot remember all the benchmarks, but Mike did
something a little less random but still O(1) a few years back and that
crashed and burned.

Also your suggestion to keep the SMT utilization but still do a traversal of
cores
in select_idle_core() while remembering the least loaded core will still
have
the problem of potentially traversing all cores. I can compare this with a
core
level only SMT balancing, is that useful to decide? I will also test on
SPARC
machines with higher degree of SMT.
Please learn to use your email client, that's near unreadable for
whitespace damage, time is really too precious to try and untangle crap
like that.
Sorry about that

You had also mentioned to do it for only SMT >2, not sure I understand why
as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
the scalability problem.
For SMT2 you don't need the occupation counter with atomic crud, a
simple !atomic core-idle works just fine. And your 'patch' had soo many
moving parts it was too hard to say which aspect changed what.

hackbench really isn't _that_ interesting a benchmark and its fairly
easy to make it go faster, but doing so typically wrecks something else.
I also had uperf and Oracle DB TPC-C numbers in the patch which showed
improvements

There's a whole array of netperf benchmarks which you have to run at
different utilization levels, there's the sysbench oltp stuff, which you
can run on MariaDB and PostgreSQL, again at different utilization levels
(both databases behave quite differently). There's ebizzy, tbench,
dbench and others.

There's also NUMA stuff, like NAS benchmarks and specJBB nonsense.

There's the facebook schbench, which is a bit finnicky to set up.

And I'm sure I'm forgetting half, look at the patches and regression
reports for more clues, that's what Google is for. You could have
figured all that out yourself, much of these are listed in the
changelogs and if you'd bothered to find lkml regression reports you
could find more.
Ok, will find out and run them.

Thanks,
Subhra

Reply via email to