On Wed, Dec 3, 2014 at 12:41 PM, Chaitanya Huilgol <[email protected]> wrote: > Hi All, > > We are seeing large read performance variations across RBD clients on > different pools. Below is the summary of our findings > > - First client starting I/O after a cluster restart (ceph start/stop on all > OSD nodes) gets the best performance > - Clients started later exhibit 40% to 70% degraded performance, This is seen > even in cases where first client I/O is stopped before starting the second > client I/O > - Adding performance counters showed large increase in latency across the > entire path and no specific point of increased latency - upto 3x increase in > latency > - On further investigation we have root caused this to degradation in > tcmalloc performance inducing large latency across the entire path > - Also the variation is more as we increase the number of op worker shards, > with lower shards the variation is lesser but this results in more lock > contention and is not a good option for SSD based clusters > - Variation is observed even when the RBD images are not written at all thus > indicating that this is not a filesystem issue > > Below is a snippet of perf top output for the two runs: > > (1) TCmalloc - Client-1 > 2.68% ceph-osd [.] crush_hash32_3 > 2.65% libtcmalloc.so.4.1.2 [.] operator new(unsigned long) > 1.66% [kernel] [k] _raw_spin_lock > 1.56% libstdc++.so.6.0.19 [.] std::basic_string<char, > std::char_traits<char>, std::allocator<char> >::basic_string(std::string > const&) > 1.51% libtcmalloc.so.4.1.2 [.] operator delete(void*) > > (2) TCmalloc - Client -2 (note significant increase in TCmalloc internal > free to central list code paths) > > 14.75% libtcmalloc.so.4.1.2 [.] > tcmalloc::CentralFreeList::FetchFromSpans() > 7.46% libtcmalloc.so.4.1.2 [.] > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) > 6.71% libtcmalloc.so.4.1.2 [.] > tcmalloc::CentralFreeList::ReleaseToSpans(void*) > 1.68% libtcmalloc.so.4.1.2 [.] operator new(unsigned long) > 1.57% ceph-osd [.] crush_hash32_3 > > Tying it all together, It looks like the new client I/O on a different pool > induces change in how the OSD shards are used, this would induce movement of > memory to/from the thread local caches to the central free lists. > Increasing the TCmalloc thread cache limit with > 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our test > setups. However this is a temporary resolution - this also bloats the OSD > memory usage >
I've noticed that tcmalloc is quite visible in perf top, but I never looked closer because we don't even have debug symbols enabled in our tcmalloc. Here's a production dumpling ceph-osd right now: Samples: 35K of event 'cycles', Event count (approx.): 4040795974, Thread: ceph-osd(13976) 87.81% libtcmalloc.so.4.1.0.#prelink#.P1wCcj [.] 0x0000000000017e6f 1.41% libpthread-2.12.so [.] pthread_mutex_lock 1.40% libstdc++.so.6.0.13 [.] 0x0000000000065b8c What value did you use for TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and how well did it alleviate the problem? I assume env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=x ceph-osd ... is sufficient to override this? Cheers, Dan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
