Re: Increasing # Shards vs multi-OSDs per device

Robert LeBlanc Wed, 11 Nov 2015 15:30:07 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I should have the weighted round robin queue ready in the next few
days. I shaking out a few bugs from converting it from my Hammer patch
and I need to write a test suite, but I can get you the branch before
then. I'd be interested to see what difference there may be as it
would help decide if this is a path to continue pursuing.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Wed, Nov 11, 2015 at 3:44 PM, Blinick, Stephen L  wrote:
> Thanks for taking a look!
>
> First, the original slides are on the Ceph slideshare here: 
> http://www.slideshare.net/Inktank_Ceph/accelerating-cassandra-workloads-on-ceph-with-allflash-pcie-ssds
>     That should show the 1/2/4 parition comparison and the overall 
> performance #’s latency, and data set size.   I didn’t provide much context 
> in this quick deck because I meant it to be a companion to that data.
>
> 1 – I agree we’re a long way from the max DC P3700 4K RR IOPS (460K), and 
> also that is at QD128.   In the future (some day) it would be nice though if 
> you could reach that performance with OSD knobs to increase worker threads.   
> We did try many shards/workers per shards combination but the real purpose of 
> the presentation was mixed r/w performance and the “Cassandra” breakdown of 
> IO sizes @ 50/50 mix.  Every deviation from the default hurt writes or mixed 
> performance so we were not just working for the highest RandRead #’s possible.
>
> 2—We used CBT for all runs, unfortunately our default collectl config didn’t 
> grab NVMe stats we’ll fix that as we move to Infernalis.. I have NVMe 
> Bandwidth only (from zabbix monitoring).  Using IOstat to spot check though 
> the QD, it was always pretty low on the devices, at most 10 with the 
> 4-partition 100% read case.   For 100% RR @ QD32:   1 OSD/NVMe = 34.4K 
> reads/device.  4 OSD/NVMe = 55.5k reads/device (14k per OSD).    What's shown 
> in the other presentation is that with 2OSD/NVMe we hit 44K iops/device, and 
> we did go to 8 OSD's/device but saw no improvement over 4.
> So, we determined 4 NVMe/OSD is worth doing over 2, and the 4 OSD's to 1 
> flash device failure boundary closely matches the generally accepted ratio of 
> OSD journals to one device.
>
> 3-- Page cache effects should be negated in these numbers. As you can see in 
> the other presentation we did one run with  2TB of data, which showed higher 
> performance (1.35M IOPS).  But the rest of the tests were run with 4.8TB data 
> (replicated twice), and uniform random distribution.  While we  did use 
> 'norandommap' for client performance, the re-referencing of blocks in that 
> size dataset should be low.   Using IOstat, and Zabbix, we correlated the 
> host read/write aggregate performance to the device statistics, so confirm 
> that there wasn't object data coming out of cache.
>
> 4 -- Yeah this multi-partitioning doesn't double or quadruple CPU/Throughput 
> and performance.  But it has significant improvements to a point (4, by 
> experimentation, in this cluster).  For SSD's, our team in Shanghai used 2 
> partitions/SSD for best performance.
>
> 5 -- Good catch, I typed this in quickly when my mic wasn't working this 
> morning :)   In all cases the # of actual worker threads is double what is 
> stated on slide #7 in the linked presentation below.  This is because every 
> shard by default = 2 workers and we did not change that in any of the 
> published tests.  When we did lower it to 1 it always hurt write performance 
> as well.
>
> 6 -- As you can see in the config, we did increase filestore_op_threads to 6  
> (this gave a 15% boost in mixed r/w performance).  Higher than that didn't 
> help.  I'm not sure if it would have helped in the case of 20 shards/OSD.
>
> I am still really curious about the scheduler behavior for threads within an 
> OSD.     Given the sleep/signal/wakeup mechanism between the msg dispatcher 
> and the worker threads, is it possible that's causing the scheduler to bump 
> threads up to a higher priority and somehow breaking fairness when there's 
> more runnable threads than CPU's?    Each node here has 72 cpu's (with HT) 
> but as you note 160 worker threads (in addition to the pipe reader/writers 
> and msg dispatchers).
>
> Thanks,
>
> Stephen
>
>
>
>
> From: Somnath Roy [mailto:its.somen...@gmail.com]
> Sent: Wednesday, November 11, 2015 3:02 PM
> To: Blinick, Stephen L
> Cc: ceph-devel@vger.kernel.org; Mark Nelson; Samuel Just; Kyle Bader; Somnath 
> Roy
> Subject: Re: Increasing # Shards vs multi-OSDs per device
>
> Thanks for the data Stephen. Some feedback:
>
> 1. I don't think single OSD is still there to serve 460K read iops 
> irrespective of how many shards/threads you are running. I didn't have your 
> NVMe data earlier :-)..But, probably for 50/60K SAS SSD iops single OSD per 
> drive is good enough. I hope you tried even increasing the shards/threads to 
> very high value (since you have lot of cpu left) say 40:2 or 80:1 (try one 
> configuration with 1 thread/shard , it should reduce contention per shard) so 
> ? Or even lower ratio like 10:2  or 20:1 ?
>
> 2. Do you have any data on disk utilization ? It will be good if we are able 
> to understand how better the single disk utilization becomes when you are 
> running multiple OSDs/drive. I kind of back calculate from your data that , 
> in case of 4 OSds/drive cases each OSD is serving ~14K read iops vs ~42K read 
> iops while having one osd/drive. So, this clearly tells that two OSDs/drive 
> should be good enough to serve similar iops in your environment. You are able 
> to extract ~56K iops per drive with 4 OSDs vs 42K for one OSD case.
>
> 3. The above calculation I have discarded all the cache effect , but, that's 
> not realistic. You have total of 128 GB * 5 = 640 GB of RAM. What is the 
> total working set of yours ? If you are having lot of cache effect in this 
> run , 4 OSDs (4 XFS) will be having better effect than one OSD /drive. This 
> could be a total number of OSD effect in the cluster but not so number of OSD 
> needed to saturate a drive.
>
> 4. Also, cpu util wise, you have only 20% more cpu util while you are running 
> 4X more OSDs.
>
> 5.  BTW, worker thread calculation is incorrect , default is 5:2 , so, each 
> osd is running with 10 worker threads and total 160 worker threads for both 4 
> OSD/drive and 1 osd/drive (20:2).
>
> 6.  Write data is surprising compare to default shard 1 OSD case, may be you 
> need to increase filestore op thrads since you have more data coming to 
> filestore ?
>
> Thanks & Regards
> Somnath
>
> On Wed, Nov 11, 2015 at 12:57 PM, Blinick, Stephen L  wrote:
> Sorry about the microphone issues in the performance meeting today today.   
> This is a followup to the 11/4 performance meeting where we discussed 
> increasing the worker thread count in the OSD's vs making multiple OSD's (and 
> partitions/filesystems) per device.     We did the high level experiment and 
> have some results which I threw into a ppt/pdf, and shared them here:
>
> http://www.docdroid.net/UbmvGnH/increasing-shards-vs-multiple-osds.pdf.html
>
> Doing 20-shard OSD's vs 4 OSD's per device with default 5 shards yielded 
> about half of the performance improvement for random 4k reads.  For writes 
> performance is actually worse than just 1 OSD per device and the default # of 
> shards.  The throttles should be large enough for the 20-shard use case as 
> they are 10x the defaults, although if you see anything we missed let us know.
>
> I had the cluster moved to Infernalis release (with JEMalloc) yesterday, so 
> hopefully we'll have some early results on the same 5-node cluster soon.
>
> Thanks,
>
> Stephen
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQ89oCRDmVDuy+mK58QAAGm0P/igsFdpIZonJ1rnBDzn1
tMPRpVcKHvL/Yqx9UgkZwXnQ+MZP6nUj/roWzhJFM8OnXIMBg7TKvQLNN1uv
NFNX6noWcQjASrUsFbkJC78xv9GOZeltN/9sE5jDZdjPHqNtP7g9n/au1DZP
qPqlJBQxoF4p1qZUijuX3JXMmLRNNvEpdhicve1gz3WG5CdtZb1p1udPECa1
7InsacCqzc9foRv23wqcnQU/cCQyZWLRDgMSFXb4/b5JErqwAV4WNL/C6oTa
hdAIsaRVlQvhj9PlYI86FYCd0sj/B1TZlRaRBKR/Eup7Yyvlo6y+EaNua7Ou
D7ilYZCBOQ+2HUaM6Dv+SRJogK35nkurkthP1hqZi0TLYpxSefzpzf+TJuvg
r/B2f26ha4lX7i023gPTij+GkpLCTJgymKWqbLHfHsNQN1/fwrgwOJ/5ySNL
TXh1iTT8ulB6GwmkPM9MRlIW6jRCoOpjWXHjE6R7wAVOh/cpLb98ie2cmW+6
sXhCllPFwpHogYJklCW+eI6eZ7T2Y26WMA/BbwVKKlPhcaU35LVym77XeqBI
804tLumsYyBVZVIlpsn1Eqk+tgh6/aNSgMXztDTWjdCVwUhhmLDGuzdDYa+1
EM2bW4+ZIeRnhab662v8muFX8ka/ee/HX43St50LeGRcYIEICsxGSCMXhVdt
kTLC
=cj9z
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Increasing # Shards vs multi-OSDs per device

Reply via email to