Hi, Have you tried running rados bench in parallel from several client machines? That would demonstrate the full BW capacity of the cluster.
e.g. make a test pool with e.g. 256 PGs (which will average 16 per OSD on your cluster). Then from several clients at once do `rados bench -p test 60 write`. And at the same time `watch ceph status` to see the total bandwidth. Then you can try different replication or erasure coding settings to learn their impact on performance... -- dan P.S. two mons is never a good idea. Use 3. PPS. What are those 21.8TB devices ? PPPS. Any reason you are running jewel instead of luminous or mimic? On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu <[email protected]> wrote: > > Hi, To make the the problem clearer, here is the configuration of the cluster: > > The 'problem' I have is the low bandwidth no matter how I increase the > concurrency. > I have tried using MPI to launch 322 processes, each calling librados to > create a handle and initialize the io context, and write one 80MB object. > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm wondering > if the number of client-osd connection is limited by the number of hosts. > > Best, > Jialin > NERSC/LBNL > > $ceph osd tree > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > > -1 1047.59473 root default > > -2 261.89868 host ngfdv036 > > 0 21.82489 osd.0 up 1.00000 1.00000 > > 4 21.82489 osd.4 up 1.00000 1.00000 > > 8 21.82489 osd.8 up 1.00000 1.00000 > > 12 21.82489 osd.12 up 1.00000 1.00000 > > 16 21.82489 osd.16 up 1.00000 1.00000 > > 20 21.82489 osd.20 up 1.00000 1.00000 > > 24 21.82489 osd.24 up 1.00000 1.00000 > > 28 21.82489 osd.28 up 1.00000 1.00000 > > 32 21.82489 osd.32 up 1.00000 1.00000 > > 36 21.82489 osd.36 up 1.00000 1.00000 > > 40 21.82489 osd.40 up 1.00000 1.00000 > > 44 21.82489 osd.44 up 1.00000 1.00000 > > -3 261.89868 host ngfdv037 > > 1 21.82489 osd.1 up 1.00000 1.00000 > > 5 21.82489 osd.5 up 1.00000 1.00000 > > 9 21.82489 osd.9 up 1.00000 1.00000 > > 13 21.82489 osd.13 up 1.00000 1.00000 > > 17 21.82489 osd.17 up 1.00000 1.00000 > > 21 21.82489 osd.21 up 1.00000 1.00000 > > 25 21.82489 osd.25 up 1.00000 1.00000 > > 29 21.82489 osd.29 up 1.00000 1.00000 > > 33 21.82489 osd.33 up 1.00000 1.00000 > > 37 21.82489 osd.37 up 1.00000 1.00000 > > 41 21.82489 osd.41 up 1.00000 1.00000 > > 45 21.82489 osd.45 up 1.00000 1.00000 > > -4 261.89868 host ngfdv038 > > 2 21.82489 osd.2 up 1.00000 1.00000 > > 6 21.82489 osd.6 up 1.00000 1.00000 > > 10 21.82489 osd.10 up 1.00000 1.00000 > > 14 21.82489 osd.14 up 1.00000 1.00000 > > 18 21.82489 osd.18 up 1.00000 1.00000 > > 22 21.82489 osd.22 up 1.00000 1.00000 > > 26 21.82489 osd.26 up 1.00000 1.00000 > > 30 21.82489 osd.30 up 1.00000 1.00000 > > 34 21.82489 osd.34 up 1.00000 1.00000 > > 38 21.82489 osd.38 up 1.00000 1.00000 > > 42 21.82489 osd.42 up 1.00000 1.00000 > > 46 21.82489 osd.46 up 1.00000 1.00000 > > -5 261.89868 host ngfdv039 > > 3 21.82489 osd.3 up 1.00000 1.00000 > > 7 21.82489 osd.7 up 1.00000 1.00000 > > 11 21.82489 osd.11 up 1.00000 1.00000 > > 15 21.82489 osd.15 up 1.00000 1.00000 > > 19 21.82489 osd.19 up 1.00000 1.00000 > > 23 21.82489 osd.23 up 1.00000 1.00000 > > 27 21.82489 osd.27 up 1.00000 1.00000 > > 31 21.82489 osd.31 up 1.00000 1.00000 > > 35 21.82489 osd.35 up 1.00000 1.00000 > > 39 21.82489 osd.39 up 1.00000 1.00000 > > 43 21.82489 osd.43 up 1.00000 1.00000 > > 47 21.82489 osd.47 up 1.00000 1.00000 > > > ceph -s > > cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a > > health HEALTH_OK > > monmap e1: 2 mons at > {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0} > > election epoch 4, quorum 0,1 ngfdv076,ngfdv078 > > osdmap e280: 48 osds: 48 up, 48 in > > flags sortbitwise,require_jewel_osds > > pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects > > 79218 MB used, 1047 TB / 1047 TB avail > > 3136 active+clean > > > > On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <[email protected]> wrote: >> >> Thank you Dan. I’ll try it. >> >> Best, >> Jialin >> NERSC/LBNL >> >> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <[email protected]> wrote: >> > >> > Hi, >> > >> > One way you can see exactly what is happening when you write an object >> > is with --debug_ms=1. >> > >> > For example, I write a 100MB object to a test pool: rados >> > --debug_ms=1 -p test put 100M.dat 100M.dat >> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV >> > In this case, it first gets the cluster maps from a mon, then writes >> > the object to osd.58, which is the primary osd for PG 119.77: >> > >> > # ceph pg 119.77 query | jq .up >> > [ >> > 58, >> > 49, >> > 31 >> > ] >> > >> > Otherwise I answered your questions below... >> > >> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <[email protected]> wrote: >> >> >> >> Hello, >> >> >> >> I have a couple questions regarding the IO on OSD via librados. >> >> >> >> >> >> 1. How to check which osd is receiving data? >> >> >> > >> > See `ceph osd map`. >> > For my example above: >> > >> > # ceph osd map test 100M.dat >> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77 >> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58) >> > >> >> 2. Can the write operation return immediately to the application once the >> >> write to the primary OSD is done? or does it return only when the data is >> >> replicated twice? (size=3) >> > >> > Write returns once it is safe on *all* replicas or EC chunks. >> > >> >> 3. What is the I/O size in the lower level in librados, e.g., if I send a >> >> 100MB request with 1 thread, does librados send the data by a fixed >> >> transaction size? >> > >> > This depends on the client. The `rados` CLI example I showed you broke >> > the 100MB object into 4MB parts. >> > Most use-cases keep the objects around 4MB or 8MB. >> > >> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from the >> >> ceph documentation, once the cluster map is received by the client, the >> >> client can talk to OSD directly, so the assumption is the max parallelism >> >> depends on the number of OSDs, is this correct? >> >> >> > >> > That's more or less correct -- the IOPS and BW capacity of the cluster >> > generally scales linearly with number of OSDs. >> > >> > Cheers, >> > Dan >> > CERN _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
