Hi Dan, Thanks for the follow-ups. I have just tried running multiple librados MPI applications from multiple nodes, it does show increased bandwidth, with ceph -w, I observed as high as 500MB/sec (previously only 160MB/sec ), I think I can do finer tuning by coordinating more concurrent applications to get the peak. (Sorry, I only have one node having rados cli installed, so I can't follow your example to stress the server)
Then you can try different replication or erasure coding settings to > learn their impact on performance... > Good points. > PPS. What are those 21.8TB devices ? > The storage arrays are Nexsan E60 arrays having two active-active redundant controllers, 60 3 TB disk drives. The disk drives are organized into six 8+2 Raid 6 LUNs of 24 TB each. > PPPS. Any reason you are running jewel instead of luminous or mimic? > I have to ask the cluster admin, I'm not sure about it. I have one more questions, regarding the OSD server and OSDs, I was told that the IO has to go through the 4 OSD servers (hosts), before touching the OSDs, This is confusing to me, as I learned from the ceph document http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/#monitoring-osds the librados can talk to the OSDs directly, what am I missing here? Best, Jialin NERSC/LBNL > On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu <[email protected]> wrote: > > > > Hi, To make the the problem clearer, here is the configuration of the > cluster: > > > > The 'problem' I have is the low bandwidth no matter how I increase the > concurrency. > > I have tried using MPI to launch 322 processes, each calling librados to > create a handle and initialize the io context, and write one 80MB object. > > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm > wondering if the number of client-osd connection is limited by the number > of hosts. > > > > Best, > > Jialin > > NERSC/LBNL > > > > $ceph osd tree > > > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > > > > -1 1047.59473 root default > > > > -2 261.89868 host ngfdv036 > > > > 0 21.82489 osd.0 up 1.00000 1.00000 > > > > 4 21.82489 osd.4 up 1.00000 1.00000 > > > > 8 21.82489 osd.8 up 1.00000 1.00000 > > > > 12 21.82489 osd.12 up 1.00000 1.00000 > > > > 16 21.82489 osd.16 up 1.00000 1.00000 > > > > 20 21.82489 osd.20 up 1.00000 1.00000 > > > > 24 21.82489 osd.24 up 1.00000 1.00000 > > > > 28 21.82489 osd.28 up 1.00000 1.00000 > > > > 32 21.82489 osd.32 up 1.00000 1.00000 > > > > 36 21.82489 osd.36 up 1.00000 1.00000 > > > > 40 21.82489 osd.40 up 1.00000 1.00000 > > > > 44 21.82489 osd.44 up 1.00000 1.00000 > > > > -3 261.89868 host ngfdv037 > > > > 1 21.82489 osd.1 up 1.00000 1.00000 > > > > 5 21.82489 osd.5 up 1.00000 1.00000 > > > > 9 21.82489 osd.9 up 1.00000 1.00000 > > > > 13 21.82489 osd.13 up 1.00000 1.00000 > > > > 17 21.82489 osd.17 up 1.00000 1.00000 > > > > 21 21.82489 osd.21 up 1.00000 1.00000 > > > > 25 21.82489 osd.25 up 1.00000 1.00000 > > > > 29 21.82489 osd.29 up 1.00000 1.00000 > > > > 33 21.82489 osd.33 up 1.00000 1.00000 > > > > 37 21.82489 osd.37 up 1.00000 1.00000 > > > > 41 21.82489 osd.41 up 1.00000 1.00000 > > > > 45 21.82489 osd.45 up 1.00000 1.00000 > > > > -4 261.89868 host ngfdv038 > > > > 2 21.82489 osd.2 up 1.00000 1.00000 > > > > 6 21.82489 osd.6 up 1.00000 1.00000 > > > > 10 21.82489 osd.10 up 1.00000 1.00000 > > > > 14 21.82489 osd.14 up 1.00000 1.00000 > > > > 18 21.82489 osd.18 up 1.00000 1.00000 > > > > 22 21.82489 osd.22 up 1.00000 1.00000 > > > > 26 21.82489 osd.26 up 1.00000 1.00000 > > > > 30 21.82489 osd.30 up 1.00000 1.00000 > > > > 34 21.82489 osd.34 up 1.00000 1.00000 > > > > 38 21.82489 osd.38 up 1.00000 1.00000 > > > > 42 21.82489 osd.42 up 1.00000 1.00000 > > > > 46 21.82489 osd.46 up 1.00000 1.00000 > > > > -5 261.89868 host ngfdv039 > > > > 3 21.82489 osd.3 up 1.00000 1.00000 > > > > 7 21.82489 osd.7 up 1.00000 1.00000 > > > > 11 21.82489 osd.11 up 1.00000 1.00000 > > > > 15 21.82489 osd.15 up 1.00000 1.00000 > > > > 19 21.82489 osd.19 up 1.00000 1.00000 > > > > 23 21.82489 osd.23 up 1.00000 1.00000 > > > > 27 21.82489 osd.27 up 1.00000 1.00000 > > > > 31 21.82489 osd.31 up 1.00000 1.00000 > > > > 35 21.82489 osd.35 up 1.00000 1.00000 > > > > 39 21.82489 osd.39 up 1.00000 1.00000 > > > > 43 21.82489 osd.43 up 1.00000 1.00000 > > > > 47 21.82489 osd.47 up 1.00000 1.00000 > > > > > > ceph -s > > > > cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a > > > > health HEALTH_OK > > > > monmap e1: 2 mons at > {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0} > > > > election epoch 4, quorum 0,1 ngfdv076,ngfdv078 > > > > osdmap e280: 48 osds: 48 up, 48 in > > > > flags sortbitwise,require_jewel_osds > > > > pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects > > > > 79218 MB used, 1047 TB / 1047 TB avail > > > > 3136 active+clean > > > > > > > > On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <[email protected]> wrote: > >> > >> Thank you Dan. I’ll try it. > >> > >> Best, > >> Jialin > >> NERSC/LBNL > >> > >> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <[email protected]> > wrote: > >> > > >> > Hi, > >> > > >> > One way you can see exactly what is happening when you write an object > >> > is with --debug_ms=1. > >> > > >> > For example, I write a 100MB object to a test pool: rados > >> > --debug_ms=1 -p test put 100M.dat 100M.dat > >> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV > >> > In this case, it first gets the cluster maps from a mon, then writes > >> > the object to osd.58, which is the primary osd for PG 119.77: > >> > > >> > # ceph pg 119.77 query | jq .up > >> > [ > >> > 58, > >> > 49, > >> > 31 > >> > ] > >> > > >> > Otherwise I answered your questions below... > >> > > >> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <[email protected]> wrote: > >> >> > >> >> Hello, > >> >> > >> >> I have a couple questions regarding the IO on OSD via librados. > >> >> > >> >> > >> >> 1. How to check which osd is receiving data? > >> >> > >> > > >> > See `ceph osd map`. > >> > For my example above: > >> > > >> > # ceph osd map test 100M.dat > >> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77 > >> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58) > >> > > >> >> 2. Can the write operation return immediately to the application > once the write to the primary OSD is done? or does it return only when the > data is replicated twice? (size=3) > >> > > >> > Write returns once it is safe on *all* replicas or EC chunks. > >> > > >> >> 3. What is the I/O size in the lower level in librados, e.g., if I > send a 100MB request with 1 thread, does librados send the data by a fixed > transaction size? > >> > > >> > This depends on the client. The `rados` CLI example I showed you broke > >> > the 100MB object into 4MB parts. > >> > Most use-cases keep the objects around 4MB or 8MB. > >> > > >> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from > the ceph documentation, once the cluster map is received by the client, the > client can talk to OSD directly, so the assumption is the max parallelism > depends on the number of OSDs, is this correct? > >> >> > >> > > >> > That's more or less correct -- the IOPS and BW capacity of the cluster > >> > generally scales linearly with number of OSDs. > >> > > >> > Cheers, > >> > Dan > >> > CERN >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
