Hi!
We got some strange performance results when running random read fio test on
our test Hammer cluster.
When we run fio-rbd (4k, randread, 8 jobs, QD=32, 500Gb rbd image) at first
time (pagecache is cold/empty)
we got ~12kiops sustained performance. It is quite resonable value, as
12kiops/34osd = 352iops per disk.
This is rather normal value per 10k sas disk. As most of the data have really
read from platters, we also got
high iowait - ~45% and average user cpu activity (~35%).
But when we run the same test second time, some data already stay in a
pagecache and can be acessed
faster, and yes, we got ~25kiops. We have low iowait (~1-3%), but surprisingly
high user cpu activity >70%
Perf top shows us, than most calls are in tcmalloc library:
19,61% libtcmalloc.so.4.2.2 [.]
tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)
15,53% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_Next(void*)
9,03% libtcmalloc.so.4.2.2 [.]
TCMalloc_PageMap3<35>::get(unsigned long) const
6,71% libtcmalloc.so.4.2.2 [.]
tcmalloc::CentralFreeList::ReleaseToSpans(void*)
1,59% libtcmalloc.so.4.2.2 [.]
tcmalloc::CentralFreeList::ReleaseListToSpans(void*)
1,58% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_PopRange(void**,
int, void**, void**)
1,42% libtcmalloc.so.4.2.2 [.]
tcmalloc::PageHeap::GetDescriptor(unsigned long) const
1,03% libtcmalloc.so.4.2.2 [.] 0x0000000000060589
0,91% libtcmalloc.so.4.2.2 [.]
tcmalloc::ThreadCache::Scavenge()
0,82% libtcmalloc.so.4.2.2 [.]
tcmalloc::DLL_Remove(tcmalloc::Span*)
0,80% libtcmalloc.so.4.2.2 [.]
tcmalloc::ThreadCache::IncreaseCacheLimitLocked()
0,75% libtcmalloc.so.4.2.2 [.] tcmalloc::Static::pageheap()
0,69% libtcmalloc.so.4.2.2 [.] PackedCache<35, unsigned
long>::GetOrDefault(unsigned long, unsigned long) const
0,51% libpthread-2.19.so [.] __pthread_mutex_unlock_usercnt
Running the same test over an RBD image in SSD pool gives the same 25-30kiops,
while every DC S3700 SSD
we used in ssd pool are easily performing >50k iops. I think, that 25-30kiops
limit we got are due to tcmalloc
inefficiency.
What we can do to improve our results? Is there are some tuning of tcmalloc, or
maybe compiling ceph
with jemalloc will give better results? Have you any thoughts?
Our small test Hammer install:
- Debian Jessie;
- Ceph Hammer 0.94.2 self-built from sources (tcmalloc)
- 1xE5-2670 + 128Gb RAM
- 2 nodes shared with mons, system and mon DB are on separate SAS mirror;
- 17 OSD on each node, SAS 10k;
- 2 Intel DC S3700 200Gb SSD for journalling on each node
- 2 Intel DC S3700 400Gb SSD for separate SSD pool
- 10Gbit interconnect, shared public and cluster metwork, MTU9100
- 10Gbit client host, fio 2.2.7 compiled with RBD engine
Megov Igor
CIO, Yuterra
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com