Hi,

Have you tried running rados bench in parallel from several client
machines? That would demonstrate the full BW capacity of the cluster.

e.g. make a test pool with e.g. 256 PGs (which will average 16 per OSD
on your cluster).
Then from several clients at once do `rados bench -p test 60 write`.
And at the same time  `watch ceph status` to see the total bandwidth.

Then you can try different replication or erasure coding settings to
learn their impact on performance...

-- dan

P.S. two mons is never a good idea. Use 3.

PPS. What are those 21.8TB devices ?

PPPS. Any reason you are running jewel instead of luminous or mimic?

On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu <[email protected]> wrote:
>
> Hi, To make the the problem clearer, here is the configuration of the cluster:
>
> The 'problem' I have is the low bandwidth no matter how I increase the 
> concurrency.
> I have tried using MPI to launch 322 processes, each calling librados to 
> create a handle and initialize the io context, and write one 80MB object.
> I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm wondering 
> if the number of client-osd connection is limited by the number of hosts.
>
> Best,
> Jialin
> NERSC/LBNL
>
> $ceph osd tree
>
> ID WEIGHT     TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
>
> -1 1047.59473 root default
>
> -2  261.89868     host ngfdv036
>
>  0   21.82489         osd.0          up  1.00000          1.00000
>
>  4   21.82489         osd.4          up  1.00000          1.00000
>
>  8   21.82489         osd.8          up  1.00000          1.00000
>
> 12   21.82489         osd.12         up  1.00000          1.00000
>
> 16   21.82489         osd.16         up  1.00000          1.00000
>
> 20   21.82489         osd.20         up  1.00000          1.00000
>
> 24   21.82489         osd.24         up  1.00000          1.00000
>
> 28   21.82489         osd.28         up  1.00000          1.00000
>
> 32   21.82489         osd.32         up  1.00000          1.00000
>
> 36   21.82489         osd.36         up  1.00000          1.00000
>
> 40   21.82489         osd.40         up  1.00000          1.00000
>
> 44   21.82489         osd.44         up  1.00000          1.00000
>
> -3  261.89868     host ngfdv037
>
>  1   21.82489         osd.1          up  1.00000          1.00000
>
>  5   21.82489         osd.5          up  1.00000          1.00000
>
>  9   21.82489         osd.9          up  1.00000          1.00000
>
> 13   21.82489         osd.13         up  1.00000          1.00000
>
> 17   21.82489         osd.17         up  1.00000          1.00000
>
> 21   21.82489         osd.21         up  1.00000          1.00000
>
> 25   21.82489         osd.25         up  1.00000          1.00000
>
> 29   21.82489         osd.29         up  1.00000          1.00000
>
> 33   21.82489         osd.33         up  1.00000          1.00000
>
> 37   21.82489         osd.37         up  1.00000          1.00000
>
> 41   21.82489         osd.41         up  1.00000          1.00000
>
> 45   21.82489         osd.45         up  1.00000          1.00000
>
> -4  261.89868     host ngfdv038
>
>  2   21.82489         osd.2          up  1.00000          1.00000
>
>  6   21.82489         osd.6          up  1.00000          1.00000
>
> 10   21.82489         osd.10         up  1.00000          1.00000
>
> 14   21.82489         osd.14         up  1.00000          1.00000
>
> 18   21.82489         osd.18         up  1.00000          1.00000
>
> 22   21.82489         osd.22         up  1.00000          1.00000
>
> 26   21.82489         osd.26         up  1.00000          1.00000
>
> 30   21.82489         osd.30         up  1.00000          1.00000
>
> 34   21.82489         osd.34         up  1.00000          1.00000
>
> 38   21.82489         osd.38         up  1.00000          1.00000
>
> 42   21.82489         osd.42         up  1.00000          1.00000
>
> 46   21.82489         osd.46         up  1.00000          1.00000
>
> -5  261.89868     host ngfdv039
>
>  3   21.82489         osd.3          up  1.00000          1.00000
>
>  7   21.82489         osd.7          up  1.00000          1.00000
>
> 11   21.82489         osd.11         up  1.00000          1.00000
>
> 15   21.82489         osd.15         up  1.00000          1.00000
>
> 19   21.82489         osd.19         up  1.00000          1.00000
>
> 23   21.82489         osd.23         up  1.00000          1.00000
>
> 27   21.82489         osd.27         up  1.00000          1.00000
>
> 31   21.82489         osd.31         up  1.00000          1.00000
>
> 35   21.82489         osd.35         up  1.00000          1.00000
>
> 39   21.82489         osd.39         up  1.00000          1.00000
>
> 43   21.82489         osd.43         up  1.00000          1.00000
>
> 47   21.82489         osd.47         up  1.00000          1.00000
>
>
> ceph -s
>
>     cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a
>
>      health HEALTH_OK
>
>      monmap e1: 2 mons at 
> {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0}
>
>             election epoch 4, quorum 0,1 ngfdv076,ngfdv078
>
>      osdmap e280: 48 osds: 48 up, 48 in
>
>             flags sortbitwise,require_jewel_osds
>
>       pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects
>
>             79218 MB used, 1047 TB / 1047 TB avail
>
>                 3136 active+clean
>
>
>
> On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <[email protected]> wrote:
>>
>> Thank you Dan. I’ll try it.
>>
>> Best,
>> Jialin
>> NERSC/LBNL
>>
>> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > One way you can see exactly what is happening when you write an object
>> > is with --debug_ms=1.
>> >
>> > For example, I write a 100MB object to a test pool:  rados
>> > --debug_ms=1 -p test put 100M.dat 100M.dat
>> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV
>> > In this case, it first gets the cluster maps from a mon, then writes
>> > the object to osd.58, which is the primary osd for PG 119.77:
>> >
>> > # ceph pg 119.77 query | jq .up
>> > [
>> >  58,
>> >  49,
>> >  31
>> > ]
>> >
>> > Otherwise I answered your questions below...
>> >
>> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <[email protected]> wrote:
>> >>
>> >> Hello,
>> >>
>> >> I have a couple questions regarding the IO on OSD via librados.
>> >>
>> >>
>> >> 1. How to check which osd is receiving data?
>> >>
>> >
>> > See `ceph osd map`.
>> > For my example above:
>> >
>> > # ceph osd map test 100M.dat
>> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77
>> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58)
>> >
>> >> 2. Can the write operation return immediately to the application once the 
>> >> write to the primary OSD is done? or does it return only when the data is 
>> >> replicated twice? (size=3)
>> >
>> > Write returns once it is safe on *all* replicas or EC chunks.
>> >
>> >> 3. What is the I/O size in the lower level in librados, e.g., if I send a 
>> >> 100MB request with 1 thread, does librados send the data by a fixed 
>> >> transaction size?
>> >
>> > This depends on the client. The `rados` CLI example I showed you broke
>> > the 100MB object into 4MB parts.
>> > Most use-cases keep the objects around 4MB or 8MB.
>> >
>> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from the 
>> >> ceph documentation, once the cluster map is received by the client, the 
>> >> client can talk to OSD directly, so the assumption is the max parallelism 
>> >> depends on the number of OSDs, is this correct?
>> >>
>> >
>> > That's more or less correct -- the IOPS and BW capacity of the cluster
>> > generally scales linearly with number of OSDs.
>> >
>> > Cheers,
>> > Dan
>> > CERN
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to