On Tue, Jun 19, 2018 at 1:04 AM Jialin Liu <jaln...@lbl.gov> wrote: > > Hi Dan, Thanks for the follow-ups. > > I have just tried running multiple librados MPI applications from multiple > nodes, it does show increased bandwidth, > with ceph -w, I observed as high as 500MB/sec (previously only 160MB/sec ), I > think I can do finer tuning by > coordinating more concurrent applications to get the peak. (Sorry, I only > have one node having rados cli installed, so I can't follow your example to > stress the server) > >> Then you can try different replication or erasure coding settings to >> learn their impact on performance... > > > Good points. > >> >> PPS. What are those 21.8TB devices ? > > > The storage arrays are Nexsan E60 arrays having two active-active redundant > controllers, 60 3 TB disk drives. The disk drives are organized into six 8+2 > Raid 6 LUNs of 24 TB each. >
This is not the ideal Ceph hardware. Ceph is designed to use disks directly -- JBODs. All redundancy is handled at the RADOS level, so you can happily save lots of cash on your servers. I suggest reading through the various Ceph hardware recommendations that you can find via Google. I can't tell from here if this is the root cause of your performance issue -- but you should plan future clusters to use JBODs instead of expensive arrays. > >> >> PPPS. Any reason you are running jewel instead of luminous or mimic? > > > I have to ask the cluster admin, I'm not sure about it. > > I have one more questions, regarding the OSD server and OSDs, I was told that > the IO has to go through the 4 OSD servers (hosts), before touching the OSDs, > This is confusing to me, as I learned from the ceph document > http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/#monitoring-osds > the librados can talk to the OSDs directly, what am I missing here? You should have one ceph-osd process per disk (or per LUN in your case). The clients connect to the ceph-osd processes directly. -- dan > > > Best, > Jialin > NERSC/LBNL > > >> >> On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu <jaln...@lbl.gov> wrote: >> > >> > Hi, To make the the problem clearer, here is the configuration of the >> > cluster: >> > >> > The 'problem' I have is the low bandwidth no matter how I increase the >> > concurrency. >> > I have tried using MPI to launch 322 processes, each calling librados to >> > create a handle and initialize the io context, and write one 80MB object. >> > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm >> > wondering if the number of client-osd connection is limited by the number >> > of hosts. >> > >> > Best, >> > Jialin >> > NERSC/LBNL >> > >> > $ceph osd tree >> > >> > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >> > >> > -1 1047.59473 root default >> > >> > -2 261.89868 host ngfdv036 >> > >> > 0 21.82489 osd.0 up 1.00000 1.00000 >> > >> > 4 21.82489 osd.4 up 1.00000 1.00000 >> > >> > 8 21.82489 osd.8 up 1.00000 1.00000 >> > >> > 12 21.82489 osd.12 up 1.00000 1.00000 >> > >> > 16 21.82489 osd.16 up 1.00000 1.00000 >> > >> > 20 21.82489 osd.20 up 1.00000 1.00000 >> > >> > 24 21.82489 osd.24 up 1.00000 1.00000 >> > >> > 28 21.82489 osd.28 up 1.00000 1.00000 >> > >> > 32 21.82489 osd.32 up 1.00000 1.00000 >> > >> > 36 21.82489 osd.36 up 1.00000 1.00000 >> > >> > 40 21.82489 osd.40 up 1.00000 1.00000 >> > >> > 44 21.82489 osd.44 up 1.00000 1.00000 >> > >> > -3 261.89868 host ngfdv037 >> > >> > 1 21.82489 osd.1 up 1.00000 1.00000 >> > >> > 5 21.82489 osd.5 up 1.00000 1.00000 >> > >> > 9 21.82489 osd.9 up 1.00000 1.00000 >> > >> > 13 21.82489 osd.13 up 1.00000 1.00000 >> > >> > 17 21.82489 osd.17 up 1.00000 1.00000 >> > >> > 21 21.82489 osd.21 up 1.00000 1.00000 >> > >> > 25 21.82489 osd.25 up 1.00000 1.00000 >> > >> > 29 21.82489 osd.29 up 1.00000 1.00000 >> > >> > 33 21.82489 osd.33 up 1.00000 1.00000 >> > >> > 37 21.82489 osd.37 up 1.00000 1.00000 >> > >> > 41 21.82489 osd.41 up 1.00000 1.00000 >> > >> > 45 21.82489 osd.45 up 1.00000 1.00000 >> > >> > -4 261.89868 host ngfdv038 >> > >> > 2 21.82489 osd.2 up 1.00000 1.00000 >> > >> > 6 21.82489 osd.6 up 1.00000 1.00000 >> > >> > 10 21.82489 osd.10 up 1.00000 1.00000 >> > >> > 14 21.82489 osd.14 up 1.00000 1.00000 >> > >> > 18 21.82489 osd.18 up 1.00000 1.00000 >> > >> > 22 21.82489 osd.22 up 1.00000 1.00000 >> > >> > 26 21.82489 osd.26 up 1.00000 1.00000 >> > >> > 30 21.82489 osd.30 up 1.00000 1.00000 >> > >> > 34 21.82489 osd.34 up 1.00000 1.00000 >> > >> > 38 21.82489 osd.38 up 1.00000 1.00000 >> > >> > 42 21.82489 osd.42 up 1.00000 1.00000 >> > >> > 46 21.82489 osd.46 up 1.00000 1.00000 >> > >> > -5 261.89868 host ngfdv039 >> > >> > 3 21.82489 osd.3 up 1.00000 1.00000 >> > >> > 7 21.82489 osd.7 up 1.00000 1.00000 >> > >> > 11 21.82489 osd.11 up 1.00000 1.00000 >> > >> > 15 21.82489 osd.15 up 1.00000 1.00000 >> > >> > 19 21.82489 osd.19 up 1.00000 1.00000 >> > >> > 23 21.82489 osd.23 up 1.00000 1.00000 >> > >> > 27 21.82489 osd.27 up 1.00000 1.00000 >> > >> > 31 21.82489 osd.31 up 1.00000 1.00000 >> > >> > 35 21.82489 osd.35 up 1.00000 1.00000 >> > >> > 39 21.82489 osd.39 up 1.00000 1.00000 >> > >> > 43 21.82489 osd.43 up 1.00000 1.00000 >> > >> > 47 21.82489 osd.47 up 1.00000 1.00000 >> > >> > >> > ceph -s >> > >> > cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a >> > >> > health HEALTH_OK >> > >> > monmap e1: 2 mons at >> > {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0} >> > >> > election epoch 4, quorum 0,1 ngfdv076,ngfdv078 >> > >> > osdmap e280: 48 osds: 48 up, 48 in >> > >> > flags sortbitwise,require_jewel_osds >> > >> > pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects >> > >> > 79218 MB used, 1047 TB / 1047 TB avail >> > >> > 3136 active+clean >> > >> > >> > >> > On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <jaln...@lbl.gov> wrote: >> >> >> >> Thank you Dan. I’ll try it. >> >> >> >> Best, >> >> Jialin >> >> NERSC/LBNL >> >> >> >> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <d...@vanderster.com> >> >> > wrote: >> >> > >> >> > Hi, >> >> > >> >> > One way you can see exactly what is happening when you write an object >> >> > is with --debug_ms=1. >> >> > >> >> > For example, I write a 100MB object to a test pool: rados >> >> > --debug_ms=1 -p test put 100M.dat 100M.dat >> >> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV >> >> > In this case, it first gets the cluster maps from a mon, then writes >> >> > the object to osd.58, which is the primary osd for PG 119.77: >> >> > >> >> > # ceph pg 119.77 query | jq .up >> >> > [ >> >> > 58, >> >> > 49, >> >> > 31 >> >> > ] >> >> > >> >> > Otherwise I answered your questions below... >> >> > >> >> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <jaln...@lbl.gov> wrote: >> >> >> >> >> >> Hello, >> >> >> >> >> >> I have a couple questions regarding the IO on OSD via librados. >> >> >> >> >> >> >> >> >> 1. How to check which osd is receiving data? >> >> >> >> >> > >> >> > See `ceph osd map`. >> >> > For my example above: >> >> > >> >> > # ceph osd map test 100M.dat >> >> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77 >> >> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58) >> >> > >> >> >> 2. Can the write operation return immediately to the application once >> >> >> the write to the primary OSD is done? or does it return only when the >> >> >> data is replicated twice? (size=3) >> >> > >> >> > Write returns once it is safe on *all* replicas or EC chunks. >> >> > >> >> >> 3. What is the I/O size in the lower level in librados, e.g., if I >> >> >> send a 100MB request with 1 thread, does librados send the data by a >> >> >> fixed transaction size? >> >> > >> >> > This depends on the client. The `rados` CLI example I showed you broke >> >> > the 100MB object into 4MB parts. >> >> > Most use-cases keep the objects around 4MB or 8MB. >> >> > >> >> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from >> >> >> the ceph documentation, once the cluster map is received by the >> >> >> client, the client can talk to OSD directly, so the assumption is the >> >> >> max parallelism depends on the number of OSDs, is this correct? >> >> >> >> >> > >> >> > That's more or less correct -- the IOPS and BW capacity of the cluster >> >> > generally scales linearly with number of OSDs. >> >> > >> >> > Cheers, >> >> > Dan >> >> > CERN _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com