hi oliver,
the IPoIB network is not 56gb, it's probably a lot less (20gb or so).
the ib_write_bw test is verbs/rdma based. do you have iperf tests
between hosts, and if so, can you share those reuslts?
stijn
> we are just getting started with our first Ceph cluster (Luminous 12.2.2) and
> doing some basic benchmarking.
>
> We have two pools:
> - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB)
> on 2 hosts (i.e. 2 SSDs each), setup as:
> - replicated, min size 2, max size 4
> - 128 PGs
> - cephfs_data, living on 6 hosts each of which has the following setup:
> - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller
> to which they are attached is in JBOD personality
> - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as
> block-db by the bluestore OSDs living on the HDDs.
> - Created with:
> ceph osd erasure-code-profile set cephfs_data k=4 m=2
> crush-device-class=hdd crush-failure-domain=host
> ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data
> - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB
> block-db
>
> The interconnect (public and cluster network)
> is made via IP over Infiniband (56 GBit bandwidth), using the software stack
> that comes with CentOS 7.
>
> This leaves us with the possibility that one of the metadata-hosts can fail,
> and still one of the disks can fail.
> For the data hosts, up to two machines total can fail.
>
> We have 40 clients connected to this cluster. We now run something like:
> dd if=/dev/zero of=some_file bs=1M count=10000
> on each CPU core of each of the clients, yielding a total of 1120 writing
> processes (all 40 clients have 28+28HT cores),
> using the ceph-fuse client.
>
> This yields a write throughput of a bit below 1 GB/s (capital B), which is
> unexpectedly low.
> Running a BeeGFS on the same cluster before (disks were in RAID 6 in that
> case) yielded throughputs of about 12 GB/s,
> but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph
> :-).
>
> I performed some basic tests to try to understand the bottleneck for Ceph:
> # rados bench -p cephfs_data 10 write --no-cleanup -t 40
> Bandwidth (MB/sec): 695.952
> Stddev Bandwidth: 295.223
> Max bandwidth (MB/sec): 1088
> Min bandwidth (MB/sec): 76
> Average IOPS: 173
> Stddev IOPS: 73
> Max IOPS: 272
> Min IOPS: 19
> Average Latency(s): 0.220967
> Stddev Latency(s): 0.305967
> Max latency(s): 2.88931
> Min latency(s): 0.0741061
>
> => This agrees mostly with our basic dd benchmark.
>
> Reading is a bit faster:
> # rados bench -p cephfs_data 10 rand
> => Bandwidth (MB/sec): 1108.75
>
> However, the disks are reasonably quick:
> # ceph tell osd.0 bench
> {
> "bytes_written": 1073741824,
> "blocksize": 4194304,
> "bytes_per_sec": 331850403
> }
>
> I checked and the OSD-hosts peaked at a load average of about 22 (they have
> 24+24HT cores) in our dd benchmark,
> but stayed well below that (only about 20 % per OSD daemon) in the rados
> bench test.
> One idea would be to switch from jerasure to ISA, since the machines are all
> Intel CPUs only anyways.
>
> Already tried:
> - TCP stack tuning (wmem, rmem), no huge effect.
> - changing the block sizes used by dd, no effect.
> - Testing network throughput with ib_write_bw, this revealed something like:
> #bytes #iterations BW peak[MB/sec] BW average[MB/sec]
> MsgRate[Mpps]
> 2 5000 19.73 19.30 10.118121
> 4 5000 52.79 51.70 13.553412
> 8 5000 101.23 96.65 12.668371
>
> 16 5000 243.66 233.42 15.297583
> 32 5000 350.66 344.73 11.296089
> 64 5000 909.14 324.85 5.322323
> 128 5000 1424.84 1401.29 11.479374
> 256 5000 2865.24 2801.04 11.473055
> 512 5000 5169.98 5095.08 10.434733
> 1024 5000 10022.75 9791.42
> 10.026410
> 2048 5000 10988.64 10628.83
> 5.441958
> 4096 5000 11401.40 11399.14
> 2.918180
> [...]
>
> So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using
> RDMA).
> Other ideas that come to mind:
> - Testing with Ceph-RDMA, but that does not seem production-ready yet, if I
> read the list correctly.
> - Increasing osd_pool_erasure_code_stripe_width.
> - Using ISA as EC plugin.
> - Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark
> is ongoing, swap is used (but not when performing benchmarking only,
> so this should not explain the slowdown).
>
> However, since we are just beginning with Ceph, it may well be we are missing
> something basic, but crucial here.
> For example, could it be that the block-db storage is too small? How to find
> out?
>
> Do any ideas come to mind?
>
> A second, hopefully easier question:
> If one OSD-host fails in our setup, all PGs are changed to
> "active+clean+remapped" and lots of data is moved.
> I understand the remapping is needed, but why is data actually moved? With
> k=4 and m=2, failure domain=host,
> and 6 hosts of which one is down, there should be no advantage for redundancy
> by moving data around after one host gone down - or do I miss something here?
>
> Cheers and many thanks in advance,
> Oliver
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com