Re: [ceph-users] IO to OSD with librados

Dan van der Ster Tue, 19 Jun 2018 00:04:00 -0700

On Tue, Jun 19, 2018 at 1:04 AM Jialin Liu <jaln...@lbl.gov> wrote:
>
> Hi Dan, Thanks for the follow-ups.
>
> I have just tried running multiple librados MPI applications from multiple 
> nodes, it does show increased bandwidth,
> with ceph -w, I observed as high as 500MB/sec (previously only 160MB/sec ), I 
> think I can do finer tuning by
> coordinating more concurrent applications to get the peak. (Sorry, I only 
> have one node having rados cli installed, so I can't follow your example to 
> stress the server)
>
>> Then you can try different replication or erasure coding settings to
>> learn their impact on performance...
>
>
> Good points.
>
>>
>> PPS. What are those 21.8TB devices ?
>
>
> The storage arrays are Nexsan E60 arrays having two active-active redundant
> controllers, 60 3 TB disk drives. The disk drives are organized into six 8+2
> Raid 6 LUNs of 24 TB each.
>


This is not the ideal Ceph hardware. Ceph is designed to use disks
directly -- JBODs. All redundancy is handled at the RADOS level, so
you can happily save lots of cash on your servers. I suggest reading
through the various Ceph hardware recommendations that you can find
via Google.

I can't tell from here if this is the root cause of your performance
issue -- but you should plan future clusters to use JBODs instead of
expensive arrays.

>
>>
>> PPPS. Any reason you are running jewel instead of luminous or mimic?
>
>
> I have to ask the cluster admin, I'm not sure about it.
>
> I have one more questions, regarding the OSD server and OSDs, I was told that 
> the IO has to go through the 4 OSD servers (hosts), before touching the OSDs,
> This is confusing to me, as I learned from the ceph document 
> http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/#monitoring-osds
> the librados can talk to the OSDs directly, what am I missing here?

You should have one ceph-osd process per disk (or per LUN in your
case). The clients connect to the ceph-osd processes directly.

-- dan


>
>
> Best,
> Jialin
> NERSC/LBNL
>
>
>>
>> On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu <jaln...@lbl.gov> wrote:
>> >
>> > Hi, To make the the problem clearer, here is the configuration of the 
>> > cluster:
>> >
>> > The 'problem' I have is the low bandwidth no matter how I increase the 
>> > concurrency.
>> > I have tried using MPI to launch 322 processes, each calling librados to 
>> > create a handle and initialize the io context, and write one 80MB object.
>> > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm 
>> > wondering if the number of client-osd connection is limited by the number 
>> > of hosts.
>> >
>> > Best,
>> > Jialin
>> > NERSC/LBNL
>> >
>> > $ceph osd tree
>> >
>> > ID WEIGHT     TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> >
>> > -1 1047.59473 root default
>> >
>> > -2  261.89868     host ngfdv036
>> >
>> >  0   21.82489         osd.0          up  1.00000          1.00000
>> >
>> >  4   21.82489         osd.4          up  1.00000          1.00000
>> >
>> >  8   21.82489         osd.8          up  1.00000          1.00000
>> >
>> > 12   21.82489         osd.12         up  1.00000          1.00000
>> >
>> > 16   21.82489         osd.16         up  1.00000          1.00000
>> >
>> > 20   21.82489         osd.20         up  1.00000          1.00000
>> >
>> > 24   21.82489         osd.24         up  1.00000          1.00000
>> >
>> > 28   21.82489         osd.28         up  1.00000          1.00000
>> >
>> > 32   21.82489         osd.32         up  1.00000          1.00000
>> >
>> > 36   21.82489         osd.36         up  1.00000          1.00000
>> >
>> > 40   21.82489         osd.40         up  1.00000          1.00000
>> >
>> > 44   21.82489         osd.44         up  1.00000          1.00000
>> >
>> > -3  261.89868     host ngfdv037
>> >
>> >  1   21.82489         osd.1          up  1.00000          1.00000
>> >
>> >  5   21.82489         osd.5          up  1.00000          1.00000
>> >
>> >  9   21.82489         osd.9          up  1.00000          1.00000
>> >
>> > 13   21.82489         osd.13         up  1.00000          1.00000
>> >
>> > 17   21.82489         osd.17         up  1.00000          1.00000
>> >
>> > 21   21.82489         osd.21         up  1.00000          1.00000
>> >
>> > 25   21.82489         osd.25         up  1.00000          1.00000
>> >
>> > 29   21.82489         osd.29         up  1.00000          1.00000
>> >
>> > 33   21.82489         osd.33         up  1.00000          1.00000
>> >
>> > 37   21.82489         osd.37         up  1.00000          1.00000
>> >
>> > 41   21.82489         osd.41         up  1.00000          1.00000
>> >
>> > 45   21.82489         osd.45         up  1.00000          1.00000
>> >
>> > -4  261.89868     host ngfdv038
>> >
>> >  2   21.82489         osd.2          up  1.00000          1.00000
>> >
>> >  6   21.82489         osd.6          up  1.00000          1.00000
>> >
>> > 10   21.82489         osd.10         up  1.00000          1.00000
>> >
>> > 14   21.82489         osd.14         up  1.00000          1.00000
>> >
>> > 18   21.82489         osd.18         up  1.00000          1.00000
>> >
>> > 22   21.82489         osd.22         up  1.00000          1.00000
>> >
>> > 26   21.82489         osd.26         up  1.00000          1.00000
>> >
>> > 30   21.82489         osd.30         up  1.00000          1.00000
>> >
>> > 34   21.82489         osd.34         up  1.00000          1.00000
>> >
>> > 38   21.82489         osd.38         up  1.00000          1.00000
>> >
>> > 42   21.82489         osd.42         up  1.00000          1.00000
>> >
>> > 46   21.82489         osd.46         up  1.00000          1.00000
>> >
>> > -5  261.89868     host ngfdv039
>> >
>> >  3   21.82489         osd.3          up  1.00000          1.00000
>> >
>> >  7   21.82489         osd.7          up  1.00000          1.00000
>> >
>> > 11   21.82489         osd.11         up  1.00000          1.00000
>> >
>> > 15   21.82489         osd.15         up  1.00000          1.00000
>> >
>> > 19   21.82489         osd.19         up  1.00000          1.00000
>> >
>> > 23   21.82489         osd.23         up  1.00000          1.00000
>> >
>> > 27   21.82489         osd.27         up  1.00000          1.00000
>> >
>> > 31   21.82489         osd.31         up  1.00000          1.00000
>> >
>> > 35   21.82489         osd.35         up  1.00000          1.00000
>> >
>> > 39   21.82489         osd.39         up  1.00000          1.00000
>> >
>> > 43   21.82489         osd.43         up  1.00000          1.00000
>> >
>> > 47   21.82489         osd.47         up  1.00000          1.00000
>> >
>> >
>> > ceph -s
>> >
>> >     cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a
>> >
>> >      health HEALTH_OK
>> >
>> >      monmap e1: 2 mons at 
>> > {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0}
>> >
>> >             election epoch 4, quorum 0,1 ngfdv076,ngfdv078
>> >
>> >      osdmap e280: 48 osds: 48 up, 48 in
>> >
>> >             flags sortbitwise,require_jewel_osds
>> >
>> >       pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects
>> >
>> >             79218 MB used, 1047 TB / 1047 TB avail
>> >
>> >                 3136 active+clean
>> >
>> >
>> >
>> > On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <jaln...@lbl.gov> wrote:
>> >>
>> >> Thank you Dan. I’ll try it.
>> >>
>> >> Best,
>> >> Jialin
>> >> NERSC/LBNL
>> >>
>> >> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <d...@vanderster.com> 
>> >> > wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > One way you can see exactly what is happening when you write an object
>> >> > is with --debug_ms=1.
>> >> >
>> >> > For example, I write a 100MB object to a test pool:  rados
>> >> > --debug_ms=1 -p test put 100M.dat 100M.dat
>> >> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV
>> >> > In this case, it first gets the cluster maps from a mon, then writes
>> >> > the object to osd.58, which is the primary osd for PG 119.77:
>> >> >
>> >> > # ceph pg 119.77 query | jq .up
>> >> > [
>> >> >  58,
>> >> >  49,
>> >> >  31
>> >> > ]
>> >> >
>> >> > Otherwise I answered your questions below...
>> >> >
>> >> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <jaln...@lbl.gov> wrote:
>> >> >>
>> >> >> Hello,
>> >> >>
>> >> >> I have a couple questions regarding the IO on OSD via librados.
>> >> >>
>> >> >>
>> >> >> 1. How to check which osd is receiving data?
>> >> >>
>> >> >
>> >> > See `ceph osd map`.
>> >> > For my example above:
>> >> >
>> >> > # ceph osd map test 100M.dat
>> >> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77
>> >> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58)
>> >> >
>> >> >> 2. Can the write operation return immediately to the application once 
>> >> >> the write to the primary OSD is done? or does it return only when the 
>> >> >> data is replicated twice? (size=3)
>> >> >
>> >> > Write returns once it is safe on *all* replicas or EC chunks.
>> >> >
>> >> >> 3. What is the I/O size in the lower level in librados, e.g., if I 
>> >> >> send a 100MB request with 1 thread, does librados send the data by a 
>> >> >> fixed transaction size?
>> >> >
>> >> > This depends on the client. The `rados` CLI example I showed you broke
>> >> > the 100MB object into 4MB parts.
>> >> > Most use-cases keep the objects around 4MB or 8MB.
>> >> >
>> >> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from 
>> >> >> the ceph documentation, once the cluster map is received by the 
>> >> >> client, the client can talk to OSD directly, so the assumption is the 
>> >> >> max parallelism depends on the number of OSDs, is this correct?
>> >> >>
>> >> >
>> >> > That's more or less correct -- the IOPS and BW capacity of the cluster
>> >> > generally scales linearly with number of OSDs.
>> >> >
>> >> > Cheers,
>> >> > Dan
>> >> > CERN
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] IO to OSD with librados

Reply via email to