Hi Oliver,

i also use Infiniband and Cephfs for HPC purposes.

My setup:

 * 4x Dell R730xd and expansion shelf, 24 OSD à 8TB, 128GB Ram,
   2x10Core Intel 4th Gen, Mellanox ConnectX-3, no SSD-Cache

 * 7x Dell R630 Clients

 * Ceph-Cluster running on Ubuntu Xenial and Ceph Jewel deployed with
   Ceph-Ansible
 * Cephfs-Clients on Debian Stretch and Cephfs kernel module

 * IPoverIB for public and custer network, IB-adapters are in connected
   mode and MTU is 65520


Future improvements: moving cephfs_metadata-pool to a NVMe pool , update to Luminous and Bluestore

root@polstor02:/home/urzadmin# ceph -s
    cluster 7c4bfd06-046f-49e4-bb77-0402d7ca98e5
     health HEALTH_OK
     monmap e2: 3 mons at {polstor01=10.10.144.211:6789/0,polstor02=10.10.144.212:6789/0,polstor03=10.10.144.213:6789/0}
            election epoch 5034, quorum 0,1,2 polstor01,polstor02,polstor03
      fsmap e2091562: 1/1/1 up {0=polstor02=up:active}, 1 up:standby-replay, 1 up:standby
     osdmap e2078945: 95 osds: 95 up, 95 in
            flags sortbitwise,require_jewel_osds
      pgmap v8638409: 4224 pgs, 2 pools, 93414 GB data, 34592 kobjects
            274 TB used, 416 TB / 690 TB avail
                4221 active+clean
                   3 active+clean+scrubbing+deep
  client io 1658 B/s rd, 3 op/s rd, 0 op/s wr


These are my messurements:

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42584
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  27.2 GBytes  23.3 Gbits/sec
[  5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42586
[  5]  0.0-10.0 sec  25.4 GBytes  21.8 Gbits/sec
[  4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42588
[  4]  0.0-10.0 sec  19.9 GBytes  17.1 Gbits/sec
[  5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42590
[  5]  0.0-10.0 sec  20.2 GBytes  17.3 Gbits/sec
[  4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42592
[  4]  0.0-10.0 sec  30.2 GBytes  25.9 Gbits/sec
[  5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42594
[  5]  0.0-10.0 sec  26.1 GBytes  22.4 Gbits/sec

root@polstor02:/home/urzadmin# rados bench -p cephfs_data 10 write --no-cleanup -t 40 [1220/1945] Maintaining 40 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_polstor02_3189601
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg lat(s)
    0       0         0         0         0         0 -           0
    1      39       262       223   891.992       892   0.0952355 0.156985
    2      39       497       458   915.934       940    0.129115 0.162122
    3      39       675       636   847.921       712    0.557279 0.172988
    4      39       857       818   817.921       728    0.154144 0.186755
    5      39      1042      1003   802.315       740    0.135748 0.191932
    6      39      1223      1184   789.248       724     0.13996 0.197136
    7      39      1411      1372   783.912       752    0.204627 0.196429
    8      39      1556      1517   758.414       580    0.253825 0.201344
    9      39      1722      1683   747.916       664    0.175682 0.209318
   10      39      1866      1827   730.715       576     0.37722 0.212927
Total time run:         10.503421
Total writes made:      1867
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     711.006
Stddev Bandwidth:       116.36
Max bandwidth (MB/sec): 940
Min bandwidth (MB/sec): 576
Average IOPS:           177
Stddev IOPS:            29
Max IOPS:               235
Min IOPS:               144
Average Latency(s):     0.222746
Stddev Latency(s):      0.160678
Max latency(s):         2.68037
Min latency(s):         0.0621196



root@polstor02:/home/urzadmin# rados bench -p cephfs_data 10 rand
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg lat(s)
    0       0         0         0         0         0 -           0
    1      15      1088      1073   4290.71      4292   0.0137212 0.0139589
    2      15      2191      2176   4351.04      4412   0.0126225 0.0138207
    3      15      3327      3312   4415.12      4544    0.013692 0.0136327
    4      15      4498      4483    4482.1      4684   0.0103933 0.0134332
    5      15      5677      5662   4528.77      4716   0.0115474 0.0132968
    6      15      6836      6821    4546.5      4636   0.0147042 0.0132476
    7      15      7967      7952   4543.19      4524   0.0138084 0.0132329
    8      15      9152      9137   4567.71      4740   0.0150901 0.013193
    9      15     10276     10261   4559.68      4496   0.0126462 0.0132172
   10      15     11424     11409   4562.83      4592   0.0139788 0.0132104
Total time run:       10.020400
Total reads made:     11424
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   4560.3
Average IOPS:         1140
Stddev IOPS:          35
Max IOPS:             1185
Min IOPS:             1073
Average Latency(s):   0.0132159
Max latency(s):       0.316514
Min latency(s):       0.00687372

In therms of native RDMA-/IB-support - well it would be really nice if the Ceph community is pushing this feature. There is a big scientific community interested in using Ceph for HPC-workloads.

Cheers,

Vadim


On 02/18/2018 04:03 PM, Oliver Freyermuth wrote:
Dear Cephalopodians,

we are just getting started with our first Ceph cluster (Luminous 12.2.2) and 
doing some basic benchmarking.

We have two pools:
- cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) on 
2 hosts (i.e. 2 SSDs each), setup as:
   - replicated, min size 2, max size 4
   - 128 PGs
- cephfs_data,     living on 6 hosts each of which has the following setup:
   - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller 
to which they are attached is in JBOD personality
   - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as 
block-db by the bluestore OSDs living on the HDDs.
   - Created with:
     ceph osd erasure-code-profile set cephfs_data k=4 m=2 
crush-device-class=hdd crush-failure-domain=host
     ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data
   - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB block-db

The interconnect (public and cluster network)
is made via IP over Infiniband (56 GBit bandwidth), using the software stack 
that comes with CentOS 7.

This leaves us with the possibility that one of the metadata-hosts can fail, 
and still one of the disks can fail.
For the data hosts, up to two machines total can fail.

We have 40 clients connected to this cluster. We now run something like:
dd if=/dev/zero of=some_file bs=1M count=10000
on each CPU core of each of the clients, yielding a total of 1120 writing 
processes (all 40 clients have 28+28HT cores),
using the ceph-fuse client.

This yields a write throughput of a bit below 1 GB/s (capital B), which is 
unexpectedly low.
Running a BeeGFS on the same cluster before (disks were in RAID 6 in that case) 
yielded throughputs of about 12 GB/s,
but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph 
:-).

I performed some basic tests to try to understand the bottleneck for Ceph:
# rados bench -p cephfs_data 10 write --no-cleanup -t 40
Bandwidth (MB/sec):     695.952
Stddev Bandwidth:       295.223
Max bandwidth (MB/sec): 1088
Min bandwidth (MB/sec): 76
Average IOPS:           173
Stddev IOPS:            73
Max IOPS:               272
Min IOPS:               19
Average Latency(s):     0.220967
Stddev Latency(s):      0.305967
Max latency(s):         2.88931
Min latency(s):         0.0741061

=> This agrees mostly with our basic dd benchmark.

Reading is a bit faster:
# rados bench -p cephfs_data 10 rand
=> Bandwidth (MB/sec):   1108.75

However, the disks are reasonably quick:
# ceph tell osd.0 bench
{
     "bytes_written": 1073741824,
     "blocksize": 4194304,
     "bytes_per_sec": 331850403
}

I checked and the OSD-hosts peaked at a load average of about 22 (they have 
24+24HT cores) in our dd benchmark,
but stayed well below that (only about 20 % per OSD daemon) in the rados bench 
test.
One idea would be to switch from jerasure to ISA, since the machines are all 
Intel CPUs only anyways.

Already tried:
- TCP stack tuning (wmem, rmem), no huge effect.
- changing the block sizes used by dd, no effect.
- Testing network throughput with ib_write_bw, this revealed something like:
  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   
MsgRate[Mpps]
  2          5000             19.73              19.30              10.118121
  4          5000             52.79              51.70              13.553412
  8          5000             101.23             96.65              12.668371
  16         5000             243.66             233.42             15.297583
  32         5000             350.66             344.73             11.296089
  64         5000             909.14             324.85             5.322323
  128        5000             1424.84            1401.29            11.479374
  256        5000             2865.24            2801.04            11.473055
  512        5000             5169.98            5095.08            10.434733
  1024       5000             10022.75            9791.42                   
10.026410
  2048       5000             10988.64            10628.83                  
5.441958
  4096       5000             11401.40            11399.14                  
2.918180
[...]

So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using 
RDMA).
Other ideas that come to mind:
- Testing with Ceph-RDMA, but that does not seem production-ready yet, if I 
read the list correctly.
- Increasing osd_pool_erasure_code_stripe_width.
- Using ISA as EC plugin.
- Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark is 
ongoing, swap is used (but not when performing benchmarking only,
   so this should not explain the slowdown).

However, since we are just beginning with Ceph, it may well be we are missing 
something basic, but crucial here.
For example, could it be that the block-db storage is too small? How to find 
out?

Do any ideas come to mind?

A second, hopefully easier question:
If one OSD-host fails in our setup, all PGs are changed to 
"active+clean+remapped" and lots of data is moved.
I understand the remapping is needed, but why is data actually moved? With k=4 
and m=2, failure domain=host,
and 6 hosts of which one is down, there should be no advantage for redundancy 
by moving data around after one host gone down - or do I miss something here?

Cheers and many thanks in advance,
        Oliver






_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Vadim Bulst

Universität Leipzig / URZ
04109  Leipzig, Augustusplatz 10

phone: +49-341-97-33380
mail:vadim.bu...@uni-leipzig.de

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to