Re: [ceph-users] fio test rbd - single thread - qd1
> `cpupower idle-set -D 0` will help you a lot, yes. > > However it seems that not only the bluestore makes it slow. >= 50% of the > latency is introduced by the OSD itself. I'm just trying to understand > WHAT parts of it are doing so much work. For example in my current case > (with cpupower idle-set -D 0 of course) when I was testing a single OSD on > a very good drive (Intel NVMe, capable of 4+ single-thread sync write > iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency, > and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x > perf dump`)! I've even tuned bluestore a little, so that now I'm getting > ~1200 iops from it. It means that the bluestore's latency dropped by 33% > (it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the > overall improvement is only 20% - everything else is eaten by the OSD > itself. Thanks for the insight - that means that the SSD-number for read/write performance are roughly ok - I guess. It still puzzles me why the bluestore-caching does not benefit the read-size. Is the cache not an LRU cache on the block device or is it actually uses for something else? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fio test rbd - single thread - qd1
On 3/20/19 3:12 AM, Vitaliy Filippov wrote: `cpupower idle-set -D 0` will help you a lot, yes. However it seems that not only the bluestore makes it slow. >= 50% of the latency is introduced by the OSD itself. I'm just trying to understand WHAT parts of it are doing so much work. For example in my current case (with cpupower idle-set -D 0 of course) when I was testing a single OSD on a very good drive (Intel NVMe, capable of 4+ single-thread sync write iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency, and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x perf dump`)! I've even tuned bluestore a little, so that now I'm getting ~1200 iops from it. It means that the bluestore's latency dropped by 33% (it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the overall improvement is only 20% - everything else is eaten by the OSD itself. I'd suggest looking in the direction of pglog. See: https://www.spinics.net/lists/ceph-devel/msg38975.html Back around that time I hacked pglog updates out of the code when I was testing a custom version of the memstore backend and saw some pretty dramatic reductions in CPU usage (and at least somewhat an increase in performance). Unfortunately I think fixing it is going to be a big job, but it's high on my list of troublemakers. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fio test rbd - single thread - qd1
`cpupower idle-set -D 0` will help you a lot, yes. However it seems that not only the bluestore makes it slow. >= 50% of the latency is introduced by the OSD itself. I'm just trying to understand WHAT parts of it are doing so much work. For example in my current case (with cpupower idle-set -D 0 of course) when I was testing a single OSD on a very good drive (Intel NVMe, capable of 4+ single-thread sync write iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency, and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x perf dump`)! I've even tuned bluestore a little, so that now I'm getting ~1200 iops from it. It means that the bluestore's latency dropped by 33% (it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the overall improvement is only 20% - everything else is eaten by the OSD itself. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fio test rbd - single thread - qd1
On 19/03/2019 16:17, jes...@krogh.cc wrote: Hi All. I'm trying to get head and tails into where we can stretch our Ceph cluster into what applications. Parallism works excellent, but baseline throughput it - perhaps - not what I would expect it to be. Luminous cluster running bluestore - all OSD-daemons have 16GB of cache. Fio files attacher - 4KB random read and 4KB random write - test file is "only" 1GB In this i ONLY care about raw IOPS numbers. I have 2 pools, both 3x replicated .. one backed with SSDs S4510's (14x1TB) and one with HDD's 84x10TB. Network latency from rbd mount to one of the osd-hosts. --- ceph-osd01.nzcorp.net ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9189ms rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms SSD: randr: # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 1727.07 2033.66 1954.71 1949.4789 46.592401 randw: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 36400.05455.26436.58 433.91417 12.468187 The double (or triple) network penalty of-course kicks in and delivers a lower throughput here. Are these performance numbers in the ballpark of what we'd expect? With 1GB of test file .. I would really expect this to be memory cached in the OSD/bluestore cache and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K IOPS Again on the write side - all OSDs are backed by Battery-Backed write cache, thus writes should go directly into memory of the constroller .. .. still slower than reads - due to having to visit 3 hosts.. but not this low? Suggestions for improvements? Are other people seeing similar results? For the HDD tests I get similar - surprisingly slow numbers: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 36.91 118.8 69.14 72.926842 21.75198 This should have the same performance characteristics as the SSD's as the writes should be hitting BBWC. # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 39 26.18181.51 48.16 50.574872 24.01572 Same here - shold be cached in the blue-store cache as it is 16GB x 84 OSD's .. with a 1GB testfile. Any thoughts - suggestions - insights ? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Cannot comment on the cache issue, hope someone will. the ssd read latency of 0.5 ms, write latency of 2 ms are in the ballpark. with Bluestore it is difficult to get below 1 ms for write. As suggested make sure your cpu has at most 1 c-state and p-state min freq is 100%. Also a cpu with higher GHz wouldg give better 1 qd / latency value than a cpu with high cores but less GHz Maged ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fio test rbd - single thread - qd1
> One thing you can check is the CPU performance (cpu governor in > particular). > On such light loads I've seen CPUs sitting in low performance mode (slower > clocks), giving MUCH worse performance results than when tried with > heavier > loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core > frequencies. > Thanks for the suggestion. They seem to be all powered up .. other suggestion/reflections are truely welcome.. Thanks. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fio test rbd - single thread - qd1
One thing you can check is the CPU performance (cpu governor in particular). On such light loads I've seen CPUs sitting in low performance mode (slower clocks), giving MUCH worse performance results than when tried with heavier loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core frequencies. On 2019-03-19 3:17 p.m., jes...@krogh.cc wrote: Hi All. I'm trying to get head and tails into where we can stretch our Ceph cluster into what applications. Parallism works excellent, but baseline throughput it - perhaps - not what I would expect it to be. Luminous cluster running bluestore - all OSD-daemons have 16GB of cache. Fio files attacher - 4KB random read and 4KB random write - test file is "only" 1GB In this i ONLY care about raw IOPS numbers. I have 2 pools, both 3x replicated .. one backed with SSDs S4510's (14x1TB) and one with HDD's 84x10TB. Network latency from rbd mount to one of the osd-hosts. --- ceph-osd01.nzcorp.net ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9189ms rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms SSD: randr: # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 1727.07 2033.66 1954.71 1949.4789 46.592401 randw: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 36400.05455.26436.58 433.91417 12.468187 The double (or triple) network penalty of-course kicks in and delivers a lower throughput here. Are these performance numbers in the ballpark of what we'd expect? With 1GB of test file .. I would really expect this to be memory cached in the OSD/bluestore cache and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K IOPS Again on the write side - all OSDs are backed by Battery-Backed write cache, thus writes should go directly into memory of the constroller .. .. still slower than reads - due to having to visit 3 hosts.. but not this low? Suggestions for improvements? Are other people seeing similar results? For the HDD tests I get similar - surprisingly slow numbers: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 36.91 118.8 69.14 72.926842 21.75198 This should have the same performance characteristics as the SSD's as the writes should be hitting BBWC. # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 39 26.18181.51 48.16 50.574872 24.01572 Same here - shold be cached in the blue-store cache as it is 16GB x 84 OSD's .. with a 1GB testfile. Any thoughts - suggestions - insights ? Jesper -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] fio test rbd - single thread - qd1
Hi All. I'm trying to get head and tails into where we can stretch our Ceph cluster into what applications. Parallism works excellent, but baseline throughput it - perhaps - not what I would expect it to be. Luminous cluster running bluestore - all OSD-daemons have 16GB of cache. Fio files attacher - 4KB random read and 4KB random write - test file is "only" 1GB In this i ONLY care about raw IOPS numbers. I have 2 pools, both 3x replicated .. one backed with SSDs S4510's (14x1TB) and one with HDD's 84x10TB. Network latency from rbd mount to one of the osd-hosts. --- ceph-osd01.nzcorp.net ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9189ms rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms SSD: randr: # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 1727.07 2033.66 1954.71 1949.4789 46.592401 randw: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 36400.05455.26436.58 433.91417 12.468187 The double (or triple) network penalty of-course kicks in and delivers a lower throughput here. Are these performance numbers in the ballpark of what we'd expect? With 1GB of test file .. I would really expect this to be memory cached in the OSD/bluestore cache and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K IOPS Again on the write side - all OSDs are backed by Battery-Backed write cache, thus writes should go directly into memory of the constroller .. .. still slower than reads - due to having to visit 3 hosts.. but not this low? Suggestions for improvements? Are other people seeing similar results? For the HDD tests I get similar - surprisingly slow numbers: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 36.91 118.8 69.14 72.926842 21.75198 This should have the same performance characteristics as the SSD's as the writes should be hitting BBWC. # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 39 26.18181.51 48.16 50.574872 24.01572 Same here - shold be cached in the blue-store cache as it is 16GB x 84 OSD's .. with a 1GB testfile. Any thoughts - suggestions - insights ? Jesper fio-single-thread-randr.ini Description: Binary data fio-single-thread-randw.ini Description: Binary data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com