Thanks for taking a look!  

First, the original slides are on the Ceph slideshare here: 
http://www.slideshare.net/Inktank_Ceph/accelerating-cassandra-workloads-on-ceph-with-allflash-pcie-ssds
    That should show the 1/2/4 parition comparison and the overall performance 
#’s latency, and data set size.   I didn’t provide much context in this quick 
deck because I meant it to be a companion to that data.   

1 – I agree we’re a long way from the max DC P3700 4K RR IOPS (460K), and also 
that is at QD128.   In the future (some day) it would be nice though if you 
could reach that performance with OSD knobs to increase worker threads.   We 
did try many shards/workers per shards combination but the real purpose of the 
presentation was mixed r/w performance and the “Cassandra” breakdown of IO 
sizes @ 50/50 mix.  Every deviation from the default hurt writes or mixed 
performance so we were not just working for the highest RandRead #’s possible. 

2—We used CBT for all runs, unfortunately our default collectl config didn’t 
grab NVMe stats we’ll fix that as we move to Infernalis.. I have NVMe Bandwidth 
only (from zabbix monitoring).  Using IOstat to spot check though the QD, it 
was always pretty low on the devices, at most 10 with the 4-partition 100% read 
case.   For 100% RR @ QD32:   1 OSD/NVMe = 34.4K reads/device.  4 OSD/NVMe = 
55.5k reads/device (14k per OSD).    What's shown in the other presentation is 
that with 2OSD/NVMe we hit 44K iops/device, and we did go to 8 OSD's/device but 
saw no improvement over 4. 
So, we determined 4 NVMe/OSD is worth doing over 2, and the 4 OSD's to 1 flash 
device failure boundary closely matches the generally accepted ratio of OSD 
journals to one device. 

3-- Page cache effects should be negated in these numbers. As you can see in 
the other presentation we did one run with  2TB of data, which showed higher 
performance (1.35M IOPS).  But the rest of the tests were run with 4.8TB data 
(replicated twice), and uniform random distribution.  While we  did use 
'norandommap' for client performance, the re-referencing of blocks in that size 
dataset should be low.   Using IOstat, and Zabbix, we correlated the host 
read/write aggregate performance to the device statistics, so confirm that 
there wasn't object data coming out of cache.  

4 -- Yeah this multi-partitioning doesn't double or quadruple CPU/Throughput 
and performance.  But it has significant improvements to a point (4, by 
experimentation, in this cluster).  For SSD's, our team in Shanghai used 2 
partitions/SSD for best performance. 

5 -- Good catch, I typed this in quickly when my mic wasn't working this 
morning :)   In all cases the # of actual worker threads is double what is 
stated on slide #7 in the linked presentation below.  This is because every 
shard by default = 2 workers and we did not change that in any of the published 
tests.  When we did lower it to 1 it always hurt write performance as well.

6 -- As you can see in the config, we did increase filestore_op_threads to 6  
(this gave a 15% boost in mixed r/w performance).  Higher than that didn't 
help.  I'm not sure if it would have helped in the case of 20 shards/OSD.  

I am still really curious about the scheduler behavior for threads within an 
OSD.     Given the sleep/signal/wakeup mechanism between the msg dispatcher and 
the worker threads, is it possible that's causing the scheduler to bump threads 
up to a higher priority and somehow breaking fairness when there's more 
runnable threads than CPU's?    Each node here has 72 cpu's (with HT) but as 
you note 160 worker threads (in addition to the pipe reader/writers and msg 
dispatchers).

Thanks,

Stephen




From: Somnath Roy [mailto:its.somen...@gmail.com] 
Sent: Wednesday, November 11, 2015 3:02 PM
To: Blinick, Stephen L
Cc: ceph-devel@vger.kernel.org; Mark Nelson; Samuel Just; Kyle Bader; Somnath 
Roy
Subject: Re: Increasing # Shards vs multi-OSDs per device

Thanks for the data Stephen. Some feedback:
 
1. I don't think single OSD is still there to serve 460K read iops irrespective 
of how many shards/threads you are running. I didn't have your NVMe data 
earlier :-)..But, probably for 50/60K SAS SSD iops single OSD per drive is good 
enough. I hope you tried even increasing the shards/threads to very high value 
(since you have lot of cpu left) say 40:2 or 80:1 (try one configuration with 1 
thread/shard , it should reduce contention per shard) so ? Or even lower ratio 
like 10:2  or 20:1 ?
 
2. Do you have any data on disk utilization ? It will be good if we are able to 
understand how better the single disk utilization becomes when you are running 
multiple OSDs/drive. I kind of back calculate from your data that , in case of 
4 OSds/drive cases each OSD is serving ~14K read iops vs ~42K read iops while 
having one osd/drive. So, this clearly tells that two OSDs/drive should be good 
enough to serve similar iops in your environment. You are able to extract ~56K 
iops per drive with 4 OSDs vs 42K for one OSD case.
 
3. The above calculation I have discarded all the cache effect , but, that's 
not realistic. You have total of 128 GB * 5 = 640 GB of RAM. What is the total 
working set of yours ? If you are having lot of cache effect in this run , 4 
OSDs (4 XFS) will be having better effect than one OSD /drive. This could be a 
total number of OSD effect in the cluster but not so number of OSD needed to 
saturate a drive.
 
4. Also, cpu util wise, you have only 20% more cpu util while you are running 
4X more OSDs.
 
5.  BTW, worker thread calculation is incorrect , default is 5:2 , so, each osd 
is running with 10 worker threads and total 160 worker threads for both 4 
OSD/drive and 1 osd/drive (20:2).
 
6.  Write data is surprising compare to default shard 1 OSD case, may be you 
need to increase filestore op thrads since you have more data coming to 
filestore ?
 
Thanks & Regards
Somnath

On Wed, Nov 11, 2015 at 12:57 PM, Blinick, Stephen L 
<stephen.l.blin...@intel.com> wrote:
Sorry about the microphone issues in the performance meeting today today.   
This is a followup to the 11/4 performance meeting where we discussed 
increasing the worker thread count in the OSD's vs making multiple OSD's (and 
partitions/filesystems) per device.     We did the high level experiment and 
have some results which I threw into a ppt/pdf, and shared them here:

http://www.docdroid.net/UbmvGnH/increasing-shards-vs-multiple-osds.pdf.html

Doing 20-shard OSD's vs 4 OSD's per device with default 5 shards yielded about 
half of the performance improvement for random 4k reads.  For writes 
performance is actually worse than just 1 OSD per device and the default # of 
shards.  The throttles should be large enough for the 20-shard use case as they 
are 10x the defaults, although if you see anything we missed let us know.

I had the cluster moved to Infernalis release (with JEMalloc) yesterday, so 
hopefully we'll have some early results on the same 5-node cluster soon.

Thanks,

Stephen


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to