Re: [ceph-users] IO to OSD with librados

Jialin Liu Mon, 18 Jun 2018 16:05:10 -0700

Hi Dan, Thanks for the follow-ups.

I have just tried running multiple librados MPI applications from multiple
nodes, it does show increased bandwidth,
with ceph -w, I observed as high as 500MB/sec (previously only 160MB/sec ),
I think I can do finer tuning by
coordinating more concurrent applications to get the peak. (Sorry, I only
have one node having rados cli installed, so I can't follow your example to
stress the server)


Then you can try different replication or erasure coding settings to
> learn their impact on performance...
>

Good points.


> PPS. What are those 21.8TB devices ?
>

The storage arrays are Nexsan E60 arrays having two active-active redundant
controllers, 60 3 TB disk drives. The disk drives are organized into six 8+2
Raid 6 LUNs of 24 TB each.



> PPPS. Any reason you are running jewel instead of luminous or mimic?
>

I have to ask the cluster admin, I'm not sure about it.

I have one more questions, regarding the OSD server and OSDs, I was told
that the IO has to go through the 4 OSD servers (hosts), before touching
the OSDs,
This is confusing to me, as I learned from the ceph document
http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/#monitoring-osds
the librados can talk to the OSDs directly, what am I missing here?


Best,
Jialin
NERSC/LBNL



> On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu <[email protected]> wrote:
> >
> > Hi, To make the the problem clearer, here is the configuration of the
> cluster:
> >
> > The 'problem' I have is the low bandwidth no matter how I increase the
> concurrency.
> > I have tried using MPI to launch 322 processes, each calling librados to
> create a handle and initialize the io context, and write one 80MB object.
> > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm
> wondering if the number of client-osd connection is limited by the number
> of hosts.
> >
> > Best,
> > Jialin
> > NERSC/LBNL
> >
> > $ceph osd tree
> >
> > ID WEIGHT     TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >
> > -1 1047.59473 root default
> >
> > -2  261.89868     host ngfdv036
> >
> >  0   21.82489         osd.0          up  1.00000          1.00000
> >
> >  4   21.82489         osd.4          up  1.00000          1.00000
> >
> >  8   21.82489         osd.8          up  1.00000          1.00000
> >
> > 12   21.82489         osd.12         up  1.00000          1.00000
> >
> > 16   21.82489         osd.16         up  1.00000          1.00000
> >
> > 20   21.82489         osd.20         up  1.00000          1.00000
> >
> > 24   21.82489         osd.24         up  1.00000          1.00000
> >
> > 28   21.82489         osd.28         up  1.00000          1.00000
> >
> > 32   21.82489         osd.32         up  1.00000          1.00000
> >
> > 36   21.82489         osd.36         up  1.00000          1.00000
> >
> > 40   21.82489         osd.40         up  1.00000          1.00000
> >
> > 44   21.82489         osd.44         up  1.00000          1.00000
> >
> > -3  261.89868     host ngfdv037
> >
> >  1   21.82489         osd.1          up  1.00000          1.00000
> >
> >  5   21.82489         osd.5          up  1.00000          1.00000
> >
> >  9   21.82489         osd.9          up  1.00000          1.00000
> >
> > 13   21.82489         osd.13         up  1.00000          1.00000
> >
> > 17   21.82489         osd.17         up  1.00000          1.00000
> >
> > 21   21.82489         osd.21         up  1.00000          1.00000
> >
> > 25   21.82489         osd.25         up  1.00000          1.00000
> >
> > 29   21.82489         osd.29         up  1.00000          1.00000
> >
> > 33   21.82489         osd.33         up  1.00000          1.00000
> >
> > 37   21.82489         osd.37         up  1.00000          1.00000
> >
> > 41   21.82489         osd.41         up  1.00000          1.00000
> >
> > 45   21.82489         osd.45         up  1.00000          1.00000
> >
> > -4  261.89868     host ngfdv038
> >
> >  2   21.82489         osd.2          up  1.00000          1.00000
> >
> >  6   21.82489         osd.6          up  1.00000          1.00000
> >
> > 10   21.82489         osd.10         up  1.00000          1.00000
> >
> > 14   21.82489         osd.14         up  1.00000          1.00000
> >
> > 18   21.82489         osd.18         up  1.00000          1.00000
> >
> > 22   21.82489         osd.22         up  1.00000          1.00000
> >
> > 26   21.82489         osd.26         up  1.00000          1.00000
> >
> > 30   21.82489         osd.30         up  1.00000          1.00000
> >
> > 34   21.82489         osd.34         up  1.00000          1.00000
> >
> > 38   21.82489         osd.38         up  1.00000          1.00000
> >
> > 42   21.82489         osd.42         up  1.00000          1.00000
> >
> > 46   21.82489         osd.46         up  1.00000          1.00000
> >
> > -5  261.89868     host ngfdv039
> >
> >  3   21.82489         osd.3          up  1.00000          1.00000
> >
> >  7   21.82489         osd.7          up  1.00000          1.00000
> >
> > 11   21.82489         osd.11         up  1.00000          1.00000
> >
> > 15   21.82489         osd.15         up  1.00000          1.00000
> >
> > 19   21.82489         osd.19         up  1.00000          1.00000
> >
> > 23   21.82489         osd.23         up  1.00000          1.00000
> >
> > 27   21.82489         osd.27         up  1.00000          1.00000
> >
> > 31   21.82489         osd.31         up  1.00000          1.00000
> >
> > 35   21.82489         osd.35         up  1.00000          1.00000
> >
> > 39   21.82489         osd.39         up  1.00000          1.00000
> >
> > 43   21.82489         osd.43         up  1.00000          1.00000
> >
> > 47   21.82489         osd.47         up  1.00000          1.00000
> >
> >
> > ceph -s
> >
> >     cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a
> >
> >      health HEALTH_OK
> >
> >      monmap e1: 2 mons at
> {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0}
> >
> >             election epoch 4, quorum 0,1 ngfdv076,ngfdv078
> >
> >      osdmap e280: 48 osds: 48 up, 48 in
> >
> >             flags sortbitwise,require_jewel_osds
> >
> >       pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects
> >
> >             79218 MB used, 1047 TB / 1047 TB avail
> >
> >                 3136 active+clean
> >
> >
> >
> > On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <[email protected]> wrote:
> >>
> >> Thank you Dan. I’ll try it.
> >>
> >> Best,
> >> Jialin
> >> NERSC/LBNL
> >>
> >> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <[email protected]>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > One way you can see exactly what is happening when you write an object
> >> > is with --debug_ms=1.
> >> >
> >> > For example, I write a 100MB object to a test pool:  rados
> >> > --debug_ms=1 -p test put 100M.dat 100M.dat
> >> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV
> >> > In this case, it first gets the cluster maps from a mon, then writes
> >> > the object to osd.58, which is the primary osd for PG 119.77:
> >> >
> >> > # ceph pg 119.77 query | jq .up
> >> > [
> >> >  58,
> >> >  49,
> >> >  31
> >> > ]
> >> >
> >> > Otherwise I answered your questions below...
> >> >
> >> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <[email protected]> wrote:
> >> >>
> >> >> Hello,
> >> >>
> >> >> I have a couple questions regarding the IO on OSD via librados.
> >> >>
> >> >>
> >> >> 1. How to check which osd is receiving data?
> >> >>
> >> >
> >> > See `ceph osd map`.
> >> > For my example above:
> >> >
> >> > # ceph osd map test 100M.dat
> >> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77
> >> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58)
> >> >
> >> >> 2. Can the write operation return immediately to the application
> once the write to the primary OSD is done? or does it return only when the
> data is replicated twice? (size=3)
> >> >
> >> > Write returns once it is safe on *all* replicas or EC chunks.
> >> >
> >> >> 3. What is the I/O size in the lower level in librados, e.g., if I
> send a 100MB request with 1 thread, does librados send the data by a fixed
> transaction size?
> >> >
> >> > This depends on the client. The `rados` CLI example I showed you broke
> >> > the 100MB object into 4MB parts.
> >> > Most use-cases keep the objects around 4MB or 8MB.
> >> >
> >> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from
> the ceph documentation, once the cluster map is received by the client, the
> client can talk to OSD directly, so the assumption is the max parallelism
> depends on the number of OSDs, is this correct?
> >> >>
> >> >
> >> > That's more or less correct -- the IOPS and BW capacity of the cluster
> >> > generally scales linearly with number of OSDs.
> >> >
> >> > Cheers,
> >> > Dan
> >> > CERN
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] IO to OSD with librados

Reply via email to