"I checked and the OSD-hosts peaked at a load average of about 22 (they
have 24+24HT cores) in our dd benchmark,
but stayed well below that (only about 20 % per OSD daemon) in the rados
bench test."

Maybe because your dd test uses bs=1M and rados bench is using 4M as
default block size?
Caspar

2018-02-18 16:03 GMT+01:00 Oliver Freyermuth <freyerm...@physik.uni-bonn.de>
:

> Dear Cephalopodians,
>
> we are just getting started with our first Ceph cluster (Luminous 12.2.2)
> and doing some basic benchmarking.
>
> We have two pools:
> - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240
> GB) on 2 hosts (i.e. 2 SSDs each), setup as:
>   - replicated, min size 2, max size 4
>   - 128 PGs
> - cephfs_data,     living on 6 hosts each of which has the following setup:
>   - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI
> controller to which they are attached is in JBOD personality
>   - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as
> block-db by the bluestore OSDs living on the HDDs.
>   - Created with:
>     ceph osd erasure-code-profile set cephfs_data k=4 m=2
> crush-device-class=hdd crush-failure-domain=host
>     ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data
>   - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB
> block-db
>
> The interconnect (public and cluster network)
> is made via IP over Infiniband (56 GBit bandwidth), using the software
> stack that comes with CentOS 7.
>
> This leaves us with the possibility that one of the metadata-hosts can
> fail, and still one of the disks can fail.
> For the data hosts, up to two machines total can fail.
>
> We have 40 clients connected to this cluster. We now run something like:
> dd if=/dev/zero of=some_file bs=1M count=10000
> on each CPU core of each of the clients, yielding a total of 1120 writing
> processes (all 40 clients have 28+28HT cores),
> using the ceph-fuse client.
>
> This yields a write throughput of a bit below 1 GB/s (capital B), which is
> unexpectedly low.
> Running a BeeGFS on the same cluster before (disks were in RAID 6 in that
> case) yielded throughputs of about 12 GB/s,
> but came with other issues (e.g. it's not FOSS...), so we'd love to run
> Ceph :-).
>
> I performed some basic tests to try to understand the bottleneck for Ceph:
> # rados bench -p cephfs_data 10 write --no-cleanup -t 40
> Bandwidth (MB/sec):     695.952
> Stddev Bandwidth:       295.223
> Max bandwidth (MB/sec): 1088
> Min bandwidth (MB/sec): 76
> Average IOPS:           173
> Stddev IOPS:            73
> Max IOPS:               272
> Min IOPS:               19
> Average Latency(s):     0.220967
> Stddev Latency(s):      0.305967
> Max latency(s):         2.88931
> Min latency(s):         0.0741061
>
> => This agrees mostly with our basic dd benchmark.
>
> Reading is a bit faster:
> # rados bench -p cephfs_data 10 rand
> => Bandwidth (MB/sec):   1108.75
>
> However, the disks are reasonably quick:
> # ceph tell osd.0 bench
> {
>     "bytes_written": 1073741824,
>     "blocksize": 4194304,
>     "bytes_per_sec": 331850403
> }
>
> I checked and the OSD-hosts peaked at a load average of about 22 (they
> have 24+24HT cores) in our dd benchmark,
> but stayed well below that (only about 20 % per OSD daemon) in the rados
> bench test.
> One idea would be to switch from jerasure to ISA, since the machines are
> all Intel CPUs only anyways.
>
> Already tried:
> - TCP stack tuning (wmem, rmem), no huge effect.
> - changing the block sizes used by dd, no effect.
> - Testing network throughput with ib_write_bw, this revealed something
> like:
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>  MsgRate[Mpps]
>  2          5000             19.73              19.30
> 10.118121
>  4          5000             52.79              51.70
> 13.553412
>  8          5000             101.23             96.65
> 12.668371
>  16         5000             243.66             233.42
>  15.297583
>  32         5000             350.66             344.73
>  11.296089
>  64         5000             909.14             324.85             5.322323
>  128        5000             1424.84            1401.29
> 11.479374
>  256        5000             2865.24            2801.04
> 11.473055
>  512        5000             5169.98            5095.08
> 10.434733
>  1024       5000             10022.75            9791.42
>  10.026410
>  2048       5000             10988.64            10628.83
> 5.441958
>  4096       5000             11401.40            11399.14
> 2.918180
> [...]
>
> So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using
> RDMA).
> Other ideas that come to mind:
> - Testing with Ceph-RDMA, but that does not seem production-ready yet, if
> I read the list correctly.
> - Increasing osd_pool_erasure_code_stripe_width.
> - Using ISA as EC plugin.
> - Reducing the bluestore_cache_size_hdd, it seems when recovery +
> benchmark is ongoing, swap is used (but not when performing benchmarking
> only,
>   so this should not explain the slowdown).
>
> However, since we are just beginning with Ceph, it may well be we are
> missing something basic, but crucial here.
> For example, could it be that the block-db storage is too small? How to
> find out?
>
> Do any ideas come to mind?
>
> A second, hopefully easier question:
> If one OSD-host fails in our setup, all PGs are changed to
> "active+clean+remapped" and lots of data is moved.
> I understand the remapping is needed, but why is data actually moved? With
> k=4 and m=2, failure domain=host,
> and 6 hosts of which one is down, there should be no advantage for
> redundancy by moving data around after one host gone down - or do I miss
> something here?
>
> Cheers and many thanks in advance,
>         Oliver
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to