Thanks Mark, I have added this item in the agenda for today's meeting Regards, Chaitanya
-----Original Message----- From: Mark Nelson [mailto:[email protected]] Sent: Wednesday, December 03, 2014 7:51 PM To: Chaitanya Huilgol; [email protected] Subject: Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue On 12/03/2014 05:41 AM, Chaitanya Huilgol wrote: > Hi All, > > We are seeing large read performance variations across RBD clients on > different pools. Below is the summary of our findings > > - First client starting I/O after a cluster restart (ceph start/stop > on all OSD nodes) gets the best performance > - Clients started later exhibit 40% to 70% degraded performance, This > is seen even in cases where first client I/O is stopped before > starting the second client I/O > - Adding performance counters showed large increase in latency across > the entire path and no specific point of increased latency - upto 3x > increase in latency > - On further investigation we have root caused this to degradation in > tcmalloc performance inducing large latency across the entire path > - Also the variation is more as we increase the number of op worker > shards, with lower shards the variation is lesser but this results in > more lock contention and is not a good option for SSD based clusters > - Variation is observed even when the RBD images are not written at > all thus indicating that this is not a filesystem issue > > Below is a snippet of perf top output for the two runs: > > (1) TCmalloc - Client-1 > 2.68% ceph-osd [.] crush_hash32_3 > 2.65% libtcmalloc.so.4.1.2 [.] operator new(unsigned long) > 1.66% [kernel] [k] _raw_spin_lock > 1.56% libstdc++.so.6.0.19 [.] std::basic_string<char, > std::char_traits<char>, std::allocator<char> >::basic_string(std::string > const&) > 1.51% libtcmalloc.so.4.1.2 [.] operator delete(void*) > > (2) TCmalloc - Client -2 (note significant increase in TCmalloc internal > free to central list code paths) > > 14.75% libtcmalloc.so.4.1.2 [.] > tcmalloc::CentralFreeList::FetchFromSpans() > 7.46% libtcmalloc.so.4.1.2 [.] > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) > 6.71% libtcmalloc.so.4.1.2 [.] > tcmalloc::CentralFreeList::ReleaseToSpans(void*) > 1.68% libtcmalloc.so.4.1.2 [.] operator new(unsigned long) > 1.57% ceph-osd [.] crush_hash32_3 > > Tying it all together, It looks like the new client I/O on a different pool > induces change in how the OSD shards are used, this would induce movement of > memory to/from the thread local caches to the central free lists. > Increasing the TCmalloc thread cache limit with > 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our > test setups. However this is a temporary resolution - this also bloats > the OSD memory usage > > We have also tested with glibc malloc and jemalloc based builds where > this issue is not seen, both hold up well, below is the perf output > from the tests > > (3) Glibc - malloc : Any client - no significant change > > 3.00% libc-2.19.so [.] _int_malloc > 2.65% libc-2.19.so [.] malloc > 2.47% libc-2.19.so [.] _int_free > 2.33% ceph-osd [.] crush_hash32_3 > 1.63% [kernel] [k] _raw_spin_lock > 1.38% libstdc++.so.6.0.19 [.] std::basic_string<char, > std::char_traits<char>, std::allocator<char> > >::basic_string(std::string const&) > > (4) Jemalloc - Any client - no significant changes > > 2.47% ceph-osd [.] crush_hash32_3 > 2.25% libjemalloc.so.1 [.] free > 2.07% libc-2.19.so [.] 0x0000000000081070 > 1.95% libjemalloc.so.1 [.] malloc > 1.65% [kernel] [k] _raw_spin_lock > 1.60% libstdc++.so.6.0.19 [.] std::basic_string<char, > std::char_traits<char>, std::allocator<char> >::basic_string(std::string > const&) > > IMHO, we should probably look at the following in general for better > performance with less variation > > - Add jemalloc option for ceph builds > - Look at ways to evenly distribute PGs across the shards - with > larger number of shards some shards do not get exercised at all while > some are overloaded > - Look at decreasing heap activity in the I/O path (index Manager, > Hash Index, LFN index etc.) > > We can discuss this further in todays performance meeting This is a fantastic writup Chaitanya. Please add to the performance meeting agenda. fwiw there are some interesting benchmarks and discussion of different allocators here: http://www.percona.com/blog/2012/07/05/impact-of-memory-allocators-on-mysql-performance/ http://www.reddit.com/r/programming/comments/18zija/github_got_30_better_performance_using_tcmalloc/ I would definitely be in favor of at least exploring options other than tcmalloc. Mark > > Thanks, > Chaitanya > > > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to [email protected] More majordomo > info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
