Hi Kristian, On Tue, Apr 29, 2014 at 12:44:22PM +0200, Kristian Nielsen wrote: > At the Barcelona meeting in January, I promised to take a look at the > high-concurrency sysbench OLTP benchmarks, and now I finally had the time do > do this. Thanks for looking at it!
> > There was a lot of work on LOCK_open by Svoj and Serg. If I have understood > correctly, the basic problem was that at high concurrency (like, 512 threads), > the TPS is only a small fraction of the peak throughput at lower concurrency. > Basically, the server "falls over" and starts trashing instead of doing real > work, due to some kind of inter-processor communication overhead. There are quite a few issues around scalability. The one that I was attempting to solve was like: MariaDB generates intensive bus traffic when run on different NUMA nodes. I suppose even 2 threads running on different nodes will be affected. It happens due to writes to shared memory location. Especially mutex performing spin-locks seem to generate a lot of bus traffic. Subsystem that mostly affect scalability are: 1. THR_LOCK - per-share 2. table cache - now mostly per-share 3. InnoDB > > I started from Axel' OLTP sysbench runs and scripts, using 10.0 from bzr > revno:4151 (revid:[email protected]). I > compiled without performance schema and with PGO, and ran sysbench 0.5 OLTP. > > (I just realised that my runs are with 32 tables, while I think the benchmarks > in January focused on single-table runs. Maybe I need to re-do my analysis > with the single-table benchmark, or perhaps it is too artificial to matter > much?). Yes, the benchmark was focused on single-table runs. Starting with 10.0.10 we eliminated LOCK_open in favor of per-share mutex. It means single-table runs scalability issues should remain, but multi-table runs scalability issues should be solved. > > In the read-only sysbench, the server mostly does not fall over. I guess this > is due to the work by Svoj on eliminating LOCK_open? Likely. I would gladly interpret benchmark results if there are any. :) Since I didn't analyze InnoDB internals wrt scalabilty yet, I'd better stay away from commenting the rest of e-mail. Thanks, Sergey > > But in read-write, performance drops dramatically at high concurrency. TPS > drops to 2600 at 512 threads compared to a peak of around 13000 (numbers here > are approximate only, they vary somewhat between different runs). > > So I analysed the r/w benchmark with the linux `perf` tool. It turns out > two-thirds of the time is spent in a single kernel function _raw_spin_lock(): > > - 66.26% mysqld [kernel.kallsyms] [k] _raw_spin_lock > > Digging further using --call-graph, this turns out to be mostly futex waits > (and futex wakeups) from inside InnoDB locking primitives. Calls like > sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in > particular. > > So this is related to the non-scalable implementation in InnoDB of locking > primitives, which is a known problem. I think Mark Callaghan has written about > it a couple of times. Last time I looked at the code, every single mutex wait > has to take a global mutex protecting some global arrays and stuff. I even > remember seeing code that at mutex release would pthread_signal_broadcast() > _every_ waiter, all of them waking up, only to all (except one) go do another > wait. This is a kiler for scalability. > > While investigating, I discovered the variable innodb_sync_array_size, which I > did not know about. It seems to split the mutex for some of the > synchronisation operations. So I tried to re-run the benchmark with > innodb_sync_array_size set to 8 and 64. In both cases, I got significant > improvement, TPS increase to 5900, twice the value with innodb_sync_array_size > set to the default of 1. > > So it is clear that the main limitation in this benchmark was the non-scalable > InnoDB synchronisation implementation. After tuning innodb_sync_array_size, > time spent in _raw_spin_lock() is down to half what it was before (33% of > total time): > > + 33.77% mysqld [kernel.kallsyms] [k] _raw_spin_lock > > Now investigating call-graphs show that the sync_array operations are much > less visible. Instead mutex_create_func(), called from > dict_mem_table_create(), is the one that turns up prominently in the profile. > I am not familiar with what this part of the InnoDB code is doing, but what I > saw from a quick look is that it creates a mutex - and there is another global > mutex needed for this, which again limits scalability. > > It is a bit surprising to see mutex creation being the most significant > bottleneck in the benchmark. I would have assumed that most mutexes could be > created up-front and re-used? It is possible that this is a warm-up thing, > maybe the code is filling up the buffer pool or some table-cache like thing > inside InnoDB? Because I see TPS being rather low for the first 150 seconds of > the run (around 3000), and then increasing suddenly to around 8000-9000 for > the rest. This might be worth investigating further. > > So in summary, my investigations found that the bottleneck in this benchmark, > and the likely cause of the fall-over, is a scalability problem with InnoDB > locking primitives. The sync_array part seems to be mitigated to some degree > by innodb_sync_array_size, the mutex creation part still needs to be > investigated. > > I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does > anyone know? I vaguely recall reading something about it, but I am not sure. > It would seem a waste to duplicate their efforts. > > In any case, I hope this was useful. As part of this investigation, I > installed a new 3.14 kernel on the lizard2 machine and a new `perf` > installation, which seems to work well to do more detailed investigations of > these kind of issues. So let me know if there are other benchmarks that I > should look into. One thing that could be interesting is to look for false > sharing; there are some performance counters that Intel manuals describe can > be used for this. > > As an aside: In my tests, once concurrency becomes high enough that the server > falls over, the actual TPS number becomes mostly meaningless. Eg. I saw > putting dummy pause loops into the code increasing TPS. If TPS stabilises at > N% of peak throughput as concurrency goes to infinity, then we can compare > N. But if N goes to zero as concurrency goes to infinite, I think it is > meaningless to compare actual TPS numbers - we should instead focus on > removing the fall-over behaviour. > > (Maybe this is already obvious to you, I have not followed the previous > benchmark efforts that closely). > > Hope this helps, > > - Kristian. _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : [email protected] Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp

