On 07/04/2014 04:13 AM, Marco Allevato wrote:
Hello Ceph-Community,
I’m writing here because we have a bad write-performance on our
Ceph-Cluster of about
_As an overview the technical details of our Cluster:_
3 x monitoring-Servers; each with 2 x 1 Gbit/s NIC configured as Bond
(Link Aggregation-Mode)
5 x datastore-Servers; each with 10 x 4 TB HDDs serving as OSDs, as
Journal we use a 15 GB LVM on an 256 GB SSD-Raid1; 2 x 10 Gbit/s NIC
configured as Bond (Link Aggregation-Mode)
What SSDs are you using? If this is just a single pair of typical
consumer grade 256GB SSDs, you are likely oversubscribing them quite a
bit. A RAID1 of two of them won't achieve anywhere near the sequential
write performance the disks can do. Since SSD failure tends to be kind
of clumpy based on the number of writes, I'm not really fond of doing
SSD journal devices in a RAID1. I'd rather reduce write wear and just
use them individually with half the journals on each SSD. Better
performance too.
__
_ceph.conf_
[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
mon_host = 172.30.30.8,172.30.30.9
mon_initial_members = monitoring1, monitoring2, monitoring3
fsid = 5f22ab94-8d96-48c2-88d3-cff7bad443a9
public network = 172.30.30.0/24
[mon.monitoring1]
host = monitoring1
addr = 172.30.30.8:6789
[mon.monitoring2]
host = monitoring2
addr = 172.30.30.9:6789
[mon.monitoring3]
host = monitoring3
addr = 172.30.30.10:6789
[filestore]
filestore max sync interval = 10
This is tough to get right imho. Less frequent synchronization gives
you more time to potentially coalesce write ops, but also may cause
spikier behaviour which we like to avoid. Probably requires just
testing it and seeing what happens in your setup.
[osd]
osd recovery max active = 1
osd journal size = 15360
osd op threads = 40
osd disk threads = 40
Having so many disk threads and op threads probably isn't going to help
you. I'd reduce them unless you've actually tested that this is faster
than the default. In some cases increasing the op threads does improve
performance, but ultimately you probably don't want to go too high. I
often set it to 8, but I suspect it may be platform dependent and how
many CPU cores and disks you have. Basically you want the minimum
number of threads that lets you achieve saturation.
Some other things to consider:
1) Are you using Kernel RBD or userland RBD? Have you considered using
RBD Cache? This can help dramatically, especially with smaller
sequential writes. In your case you are testing 1MB random writes, so
it may not help as much, but it's still worth trying.
2) If you have few PGs in the pool and *lots* of objects (say many large
volumes, and/or small RBD object sizes), you may benefit from increasing
the filestore merge threshold and filestore split multiple parameters.
These govern how the filestore splits directories that objects are
stored in under the OSD per PG. It appears that increasing the
per-directory limits can reduce performance degradation as the number of
objects increases, but may slow down lookup operations. Currently the
default is 300 objects per directory however which I believe may be too
low. Here's what I like to use:
filestore merge threshold = 40
filestore split multiple = 8
3) you might find that disabling in-memory debugging may help, though
this typically helps more with small random IOs.
debug_lockdep = "0/0"
debug_context = "0/0"
debug_crush = "0/0"
debug_mds = "0/0"
debug_mds_balancer = "0/0"
debug_mds_locker = "0/0"
debug_mds_log = "0/0"
debug_mds_log_expire = "0/0"
debug_mds_migrator = "0/0"
debug_buffer = "0/0"
debug_timer = "0/0"
debug_filer = "0/0"
debug_objecter = "0/0"
debug_rados = "0/0"
debug_rbd = "0/0"
debug_journaler = "0/0"
debug_objectcacher = "0/0"
debug_client = "0/0"
debug_osd = "0/0"
debug_optracker = "0/0"
debug_objclass = "0/0"
debug_filestore = "0/0"
debug_journal = "0/0"
debug_ms = "0/0"
debug_mon = "0/0"
debug_monc = "0/0"
debug_paxos = "0/0"
debug_tp = "0/0"
debug_auth = "0/0"
debug_finisher = "0/0"
debug_heartbeatmap = "0/0"
debug_perfcounter = "0/0"
debug_rgw = "0/0"
debug_hadoop = "0/0"
debug_asok = "0/0"
debug_throttle = "0/0"
[osd.0]
host = datastore1
[osd.1]
host = datastore1
[osd.2]
host = datastore1
[osd.3]
host = datastore1
[osd.4]
host = datastore1
[osd.5]
host = datastore1
[osd.6]
host = datastore1
[osd.7]
host = datastore1
[osd.8]
host = datastore1
[osd.9]
host = datastore1
[osd.10]
host = datastore2
[osd.11]
host = datastore2
[osd.11]
host = datastore2
[osd.12]
host = datastore2
[osd.13]
host = datastore2
[osd.14]
host = datastore2
[osd.15]
host = datastore2
[osd.16]
host = datastore2
[osd.17]
host = datastore2
[osd.18]
host = datastore2
[osd.19]
host = datastore2
[osd.20]
host = datastore3
[osd.21]
host = datastore3
[osd.22]
host = datastore3
[osd.23]
host = datastore3
[osd.24]
host = datastore3
[osd.25]
host = datastore3
[osd.26]
host = datastore3
[osd.27]
host = datastore3
[osd.28]
host = datastore3
[osd.29]
host = datastore3
[osd.30]
host = datastore4
[osd.31]
host = datastore4
[osd.32]
host = datastore4
[osd.33]
host = datastore4
[osd.34]
host = datastore4
[osd.35]
host = datastore4
[osd.36]
host = datastore4
[osd.37]
host = datastore4
[osd.38]
host = datastore4
[osd.39]
host = datastore4
[osd.0]
host = datastore5
[osd.40]
host = datastore5
[osd.41]
host = datastore5
[osd.42]
host = datastore5
[osd.43]
host = datastore5
[osd.44]
host = datastore5
[osd.45]
host = datastore5
[osd.46]
host = datastore5
[osd.47]
host = datastore5
[osd.48]
host = datastore5
You appear to have osd.0 defined twice on datastore0 and datastore5.
Not sure if that was a transcription mistake or actually in the ceph
configuration?
We have 3 pools:
-> 2 x 1000 pgs with 2 Replicas distributing the data equally to two
racks (Used for datastore 1-4)
1000 PGs isn't imho enough for good distribution over 40 OSDs. I'd up
this to around 4096.
-> 1 x 100 pgs without replication; data only stored on datastore 5.
This Pool is used to compare the performance on local disks without
networking
100 PGs *really* isn't enough. For 10 OSDs I'd at least do 1024.
Remember the filestore split multiple parameter from above? The fewer
PGs you have, the more important this becomes because you are shoving
more objects into each PG and causing directory splits to happen much
faster.
Consider that a 32GB volume is 32768MB, and we have 4MB RBD objects.
that's 8192 objects. Split that over 80 OSDs and you end up with around
80 objects per PG on average. Consider this isn't a particularly good
distribution though, so some PGs are going to have more objects than
others. Now in this case you probably haven't hit the point where
directories start splitting, but had you used replication, bigger
volumes, or multiple volumes on multiple clients, you very easily could
have started invoking directory splits on the OSDs.
At some point you have to be careful increasing the number of PGs
because it puts more load on the monitors, but typically this won't be a
problem until you get up closer to 100,000 PGs or more. I've
successfully tested a cluster with 1 million PGs.
Here are the performance values, which I get using fio-Bench on a 32GB rbd:
__
_On 1000 pgs-Pool with distribution_
fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32
--runtime=60 --name=/dev/rbd/pool1/bench1
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/312.0M/0K /s] [0 /312 /0 iops]
[eta 00m:00s]
/dev/rbd/pool1/bench1: (groupid=0, jobs=1): err= 0: pid=21675: Fri Jul
4 11:03:52 2014
write: io=21071MB, bw=358989KB/s, iops=350 , runt= 60104msec
slat (usec): min=127 , max=8040 , avg=511.49, stdev=216.27
clat (msec): min=5 , max=4018 , avg=90.74, stdev=215.83
lat (msec): min=6 , max=4018 , avg=91.25, stdev=215.83
clat percentiles (msec):
| 1.00th=[ 8], 5.00th=[ 9], 10.00th=[ 11], 20.00th=[ 15],
| 30.00th=[ 21], 40.00th=[ 30], 50.00th=[ 45], 60.00th=[ 63],
| 70.00th=[ 83], 80.00th=[ 105], 90.00th=[ 129], 95.00th=[ 190],
| 99.00th=[ 1254], 99.50th=[ 1680], 99.90th=[ 2409], 99.95th=[ 2638],
| 99.99th=[ 3556]
bw (KB/s) : min=68210, max=479232, per=100.00%, avg=368399.55,
stdev=84457.12
lat (msec) : 10=9.50%, 20=20.02%, 50=23.56%, 100=24.56%, 250=18.09%
lat (msec) : 500=1.39%, 750=0.81%, 1000=0.65%, 2000=1.13%, >=2000=0.29%
cpu : usr=11.17%, sys=7.46%, ctx=17772, majf=0, minf=24
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=0/w=21071/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=21071MB, *aggrb=358989KB/s, minb=358989KB/s,
maxb=358989KB/s, mint=60104msec, maxt=60104msec*
**
**
_On 100 pgs-Pool without distribution:_
__
WRITE: io=5884.0MB, *aggrb=297953KB/s, minb=297953KB/s, maxb=297953KB/s,
mint=20222msec, maxt=20222msec*
Do you have any suggestion on how to improve the performace?
Just a general note, do you expect 1MB random writes to be a typical
workload for your cluster? Having said that, what I'm noticing here is
that in the single-node setup with no replication, you get around
300MB/s. If you have a single SSD RAID1 for your journals, 300MB/s
sounds about right for many ~250GB models.
For the 40 OSD pool, you are doing 2x replication, so getting an
aggregate of about 180MB/s per node which is quite a bit lower. On the
other hand, the amount of concurrency you have isn't really very high
for that many disks and you will lose some aggregate performance when
replication is enabled.
The first thing I would do is just try your test with more concurrency.
If you don't have more clients, just increase the number of fio jobs.
I'd also increase the number of PGs and possibly the split multiplier
when you do this! Next I'd try enabling RBD cache and making a couple
of the ceph.conf tunable tweaks mentioned above. Finally I would
seriously consider switching the SSD journal configuration so that half
of the journals are on each SSD, directly placed in raw partition.
While Reading on the internet, typical write-rates should be around
800-1000 Mb/sec if using 10 Gbit/s-Connection with a similar setup.
You won't get that unless you have very fast SSDs or I'm
misunderstanding your SSD journal setup.
Thanks in advance
Hope this helps!
--
Marco Allevato
Projektteam
Network Engineering GmbH
Maximilianstrasse 93
D-67346 Speyer
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com