"I checked and the OSD-hosts peaked at a load average of about 22 (they
have 24+24HT cores) in our dd benchmark,
but stayed well below that (only about 20 % per OSD daemon) in the rados
Maybe because your dd test uses bs=1M and rados bench is using 4M as
default block size?
2018-02-18 16:03 GMT+01:00 Oliver Freyermuth <freyerm...@physik.uni-bonn.de>
> Dear Cephalopodians,
> we are just getting started with our first Ceph cluster (Luminous 12.2.2)
> and doing some basic benchmarking.
> We have two pools:
> - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240
> GB) on 2 hosts (i.e. 2 SSDs each), setup as:
> - replicated, min size 2, max size 4
> - 128 PGs
> - cephfs_data, living on 6 hosts each of which has the following setup:
> - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI
> controller to which they are attached is in JBOD personality
> - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as
> block-db by the bluestore OSDs living on the HDDs.
> - Created with:
> ceph osd erasure-code-profile set cephfs_data k=4 m=2
> crush-device-class=hdd crush-failure-domain=host
> ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data
> - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB
> The interconnect (public and cluster network)
> is made via IP over Infiniband (56 GBit bandwidth), using the software
> stack that comes with CentOS 7.
> This leaves us with the possibility that one of the metadata-hosts can
> fail, and still one of the disks can fail.
> For the data hosts, up to two machines total can fail.
> We have 40 clients connected to this cluster. We now run something like:
> dd if=/dev/zero of=some_file bs=1M count=10000
> on each CPU core of each of the clients, yielding a total of 1120 writing
> processes (all 40 clients have 28+28HT cores),
> using the ceph-fuse client.
> This yields a write throughput of a bit below 1 GB/s (capital B), which is
> unexpectedly low.
> Running a BeeGFS on the same cluster before (disks were in RAID 6 in that
> case) yielded throughputs of about 12 GB/s,
> but came with other issues (e.g. it's not FOSS...), so we'd love to run
> Ceph :-).
> I performed some basic tests to try to understand the bottleneck for Ceph:
> # rados bench -p cephfs_data 10 write --no-cleanup -t 40
> Bandwidth (MB/sec): 695.952
> Stddev Bandwidth: 295.223
> Max bandwidth (MB/sec): 1088
> Min bandwidth (MB/sec): 76
> Average IOPS: 173
> Stddev IOPS: 73
> Max IOPS: 272
> Min IOPS: 19
> Average Latency(s): 0.220967
> Stddev Latency(s): 0.305967
> Max latency(s): 2.88931
> Min latency(s): 0.0741061
> => This agrees mostly with our basic dd benchmark.
> Reading is a bit faster:
> # rados bench -p cephfs_data 10 rand
> => Bandwidth (MB/sec): 1108.75
> However, the disks are reasonably quick:
> # ceph tell osd.0 bench
> "bytes_written": 1073741824,
> "blocksize": 4194304,
> "bytes_per_sec": 331850403
> I checked and the OSD-hosts peaked at a load average of about 22 (they
> have 24+24HT cores) in our dd benchmark,
> but stayed well below that (only about 20 % per OSD daemon) in the rados
> bench test.
> One idea would be to switch from jerasure to ISA, since the machines are
> all Intel CPUs only anyways.
> Already tried:
> - TCP stack tuning (wmem, rmem), no huge effect.
> - changing the block sizes used by dd, no effect.
> - Testing network throughput with ib_write_bw, this revealed something
> #bytes #iterations BW peak[MB/sec] BW average[MB/sec]
> 2 5000 19.73 19.30
> 4 5000 52.79 51.70
> 8 5000 101.23 96.65
> 16 5000 243.66 233.42
> 32 5000 350.66 344.73
> 64 5000 909.14 324.85 5.322323
> 128 5000 1424.84 1401.29
> 256 5000 2865.24 2801.04
> 512 5000 5169.98 5095.08
> 1024 5000 10022.75 9791.42
> 2048 5000 10988.64 10628.83
> 4096 5000 11401.40 11399.14
> So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using
> Other ideas that come to mind:
> - Testing with Ceph-RDMA, but that does not seem production-ready yet, if
> I read the list correctly.
> - Increasing osd_pool_erasure_code_stripe_width.
> - Using ISA as EC plugin.
> - Reducing the bluestore_cache_size_hdd, it seems when recovery +
> benchmark is ongoing, swap is used (but not when performing benchmarking
> so this should not explain the slowdown).
> However, since we are just beginning with Ceph, it may well be we are
> missing something basic, but crucial here.
> For example, could it be that the block-db storage is too small? How to
> find out?
> Do any ideas come to mind?
> A second, hopefully easier question:
> If one OSD-host fails in our setup, all PGs are changed to
> "active+clean+remapped" and lots of data is moved.
> I understand the remapping is needed, but why is data actually moved? With
> k=4 and m=2, failure domain=host,
> and 6 hosts of which one is down, there should be no advantage for
> redundancy by moving data around after one host gone down - or do I miss
> something here?
> Cheers and many thanks in advance,
> ceph-users mailing list
ceph-users mailing list