Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread jesper
> `cpupower idle-set -D 0` will help you a lot, yes.
>
> However it seems that not only the bluestore makes it slow. >= 50% of the
> latency is introduced by the OSD itself. I'm just trying to understand
> WHAT parts of it are doing so much work. For example in my current case
> (with cpupower idle-set -D 0 of course) when I was testing a single OSD on
> a very good drive (Intel NVMe, capable of 4+ single-thread sync write
> iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency,
> and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x
> perf dump`)! I've even tuned bluestore a little, so that now I'm getting
> ~1200 iops from it. It means that the bluestore's latency dropped by 33%
> (it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the
> overall improvement is only 20% - everything else is eaten by the OSD
> itself.


Thanks for the insight - that means that the SSD-number for read/write
performance are roughly ok - I guess.

It still puzzles me why the bluestore-caching does not benefit
the read-size.

Is the cache not an LRU cache on the block device or is it actually uses for
something else?

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Mark Nelson

On 3/20/19 3:12 AM, Vitaliy Filippov wrote:

`cpupower idle-set -D 0` will help you a lot, yes.

However it seems that not only the bluestore makes it slow. >= 50% of 
the latency is introduced by the OSD itself. I'm just trying to 
understand WHAT parts of it are doing so much work. For example in my 
current case (with cpupower idle-set -D 0 of course) when I was 
testing a single OSD on a very good drive (Intel NVMe, capable of 
4+ single-thread sync write iops) it was delivering me only 
950-1000 iops. It's roughly 1 ms latency, and only 50% of it comes 
from bluestore (you can see it `ceph daemon osd.x perf dump`)! I've 
even tuned bluestore a little, so that now I'm getting ~1200 iops from 
it. It means that the bluestore's latency dropped by 33% (it was 
around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the 
overall improvement is only 20% - everything else is eaten by the OSD 
itself.




I'd suggest looking in the direction of pglog.  See:


https://www.spinics.net/lists/ceph-devel/msg38975.html


Back around that time I hacked pglog updates out of the code when I was 
testing a custom version of the memstore backend and saw some pretty 
dramatic reductions in CPU usage (and at least somewhat an increase in 
performance).  Unfortunately I think fixing it is going to be a big job, 
but it's high on my list of troublemakers.



Mark


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Vitaliy Filippov

`cpupower idle-set -D 0` will help you a lot, yes.

However it seems that not only the bluestore makes it slow. >= 50% of the  
latency is introduced by the OSD itself. I'm just trying to understand  
WHAT parts of it are doing so much work. For example in my current case  
(with cpupower idle-set -D 0 of course) when I was testing a single OSD on  
a very good drive (Intel NVMe, capable of 4+ single-thread sync write  
iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency,  
and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x  
perf dump`)! I've even tuned bluestore a little, so that now I'm getting  
~1200 iops from it. It means that the bluestore's latency dropped by 33%  
(it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the  
overall improvement is only 20% - everything else is eaten by the OSD  
itself.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Maged Mokhtar




On 19/03/2019 16:17, jes...@krogh.cc wrote:

Hi All.

I'm trying to get head and tails into where we can stretch our Ceph cluster
into what applications. Parallism works excellent, but baseline throughput
it - perhaps - not what I would expect it to be.

Luminous cluster running bluestore - all OSD-daemons have 16GB of cache.

Fio files attacher - 4KB random read and 4KB random write - test file is
"only" 1GB
In this i ONLY care about raw IOPS numbers.

I have 2 pools, both 3x replicated .. one backed with SSDs S4510's
(14x1TB) and one with HDD's 84x10TB.

Network latency from rbd mount to one of the osd-hosts.
--- ceph-osd01.nzcorp.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9189ms
rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms

SSD:
randr:
# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38   1727.07   2033.66   1954.71 1949.4789 46.592401
randw:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  36400.05455.26436.58 433.91417 12.468187

The double (or triple) network penalty of-course kicks in and delivers a
lower throughput here.
Are these performance numbers in the ballpark of what we'd expect?

With 1GB of test file .. I would really expect this to be memory cached in
the OSD/bluestore cache
and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K
IOPS

Again on the write side - all OSDs are backed by Battery-Backed write
cache, thus writes should go directly
into memory of the constroller .. .. still slower than reads - due to
having to visit 3 hosts.. but not this low?

Suggestions for improvements? Are other people seeing similar results?

For the HDD tests I get similar - surprisingly slow numbers:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38 36.91 118.8 69.14 72.926842  21.75198

This should have the same performance characteristics as the SSD's as the
writes should be hitting BBWC.

# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  39 26.18181.51 48.16 50.574872  24.01572

Same here - shold be cached in the blue-store cache as it is 16GB x 84
OSD's  .. with a 1GB testfile.

Any thoughts - suggestions - insights ?

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Cannot comment on the cache issue, hope someone will.

the ssd read latency of 0.5 ms, write latency of 2 ms are in the 
ballpark. with Bluestore it is difficult to get below 1 ms for write.


As suggested make sure your cpu has at most 1 c-state and p-state min 
freq is 100%. Also a cpu with higher GHz wouldg give better 1 qd / 
latency value than a cpu with high cores but less GHz


Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread jesper
> One thing you can check is the CPU performance (cpu governor in
> particular).
> On such light loads I've seen CPUs sitting in low performance mode (slower
> clocks), giving MUCH worse performance results than when tried with
> heavier
> loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core
> frequencies.
>

Thanks for the suggestion. They seem to be all powered up .. other
suggestion/reflections
are truely welcome.. Thanks.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread Piotr Dałek
One thing you can check is the CPU performance (cpu governor in particular). 
On such light loads I've seen CPUs sitting in low performance mode (slower 
clocks), giving MUCH worse performance results than when tried with heavier 
loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core 
frequencies.


On 2019-03-19 3:17 p.m., jes...@krogh.cc wrote:

Hi All.

I'm trying to get head and tails into where we can stretch our Ceph cluster
into what applications. Parallism works excellent, but baseline throughput
it - perhaps - not what I would expect it to be.

Luminous cluster running bluestore - all OSD-daemons have 16GB of cache.

Fio files attacher - 4KB random read and 4KB random write - test file is
"only" 1GB
In this i ONLY care about raw IOPS numbers.

I have 2 pools, both 3x replicated .. one backed with SSDs S4510's
(14x1TB) and one with HDD's 84x10TB.

Network latency from rbd mount to one of the osd-hosts.
--- ceph-osd01.nzcorp.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9189ms
rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms

SSD:
randr:
# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38   1727.07   2033.66   1954.71 1949.4789 46.592401
randw:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  36400.05455.26436.58 433.91417 12.468187

The double (or triple) network penalty of-course kicks in and delivers a
lower throughput here.
Are these performance numbers in the ballpark of what we'd expect?

With 1GB of test file .. I would really expect this to be memory cached in
the OSD/bluestore cache
and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K
IOPS

Again on the write side - all OSDs are backed by Battery-Backed write
cache, thus writes should go directly
into memory of the constroller .. .. still slower than reads - due to
having to visit 3 hosts.. but not this low?

Suggestions for improvements? Are other people seeing similar results?

For the HDD tests I get similar - surprisingly slow numbers:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38 36.91 118.8 69.14 72.926842  21.75198

This should have the same performance characteristics as the SSD's as the
writes should be hitting BBWC.

# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  39 26.18181.51 48.16 50.574872  24.01572

Same here - shold be cached in the blue-store cache as it is 16GB x 84
OSD's  .. with a 1GB testfile.

Any thoughts - suggestions - insights ?

Jesper



--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread jesper
Hi All.

I'm trying to get head and tails into where we can stretch our Ceph cluster
into what applications. Parallism works excellent, but baseline throughput
it - perhaps - not what I would expect it to be.

Luminous cluster running bluestore - all OSD-daemons have 16GB of cache.

Fio files attacher - 4KB random read and 4KB random write - test file is
"only" 1GB
In this i ONLY care about raw IOPS numbers.

I have 2 pools, both 3x replicated .. one backed with SSDs S4510's
(14x1TB) and one with HDD's 84x10TB.

Network latency from rbd mount to one of the osd-hosts.
--- ceph-osd01.nzcorp.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9189ms
rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms

SSD:
randr:
# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  38   1727.07   2033.66   1954.71 1949.4789 46.592401
randw:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  36400.05455.26436.58 433.91417 12.468187

The double (or triple) network penalty of-course kicks in and delivers a
lower throughput here.
Are these performance numbers in the ballpark of what we'd expect?

With 1GB of test file .. I would really expect this to be memory cached in
the OSD/bluestore cache
and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K
IOPS

Again on the write side - all OSDs are backed by Battery-Backed write
cache, thus writes should go directly
into memory of the constroller .. .. still slower than reads - due to
having to visit 3 hosts.. but not this low?

Suggestions for improvements? Are other people seeing similar results?

For the HDD tests I get similar - surprisingly slow numbers:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  38 36.91 118.8 69.14 72.926842  21.75198

This should have the same performance characteristics as the SSD's as the
writes should be hitting BBWC.

# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  39 26.18181.51 48.16 50.574872  24.01572

Same here - shold be cached in the blue-store cache as it is 16GB x 84
OSD's  .. with a 1GB testfile.

Any thoughts - suggestions - insights ?

Jesper

fio-single-thread-randr.ini
Description: Binary data


fio-single-thread-randw.ini
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com