Re: [ceph-users] IO to OSD with librados

2018-06-19 Thread Hervé Ballans

Le 19/06/2018 à 09:02, Dan van der Ster a écrit :

The storage arrays are Nexsan E60 arrays having two active-active redundant
controllers, 60 3 TB disk drives. The disk drives are organized into six 8+2
Raid 6 LUNs of 24 TB each.


This is not the ideal Ceph hardware. Ceph is designed to use disks
directly -- JBODs. All redundancy is handled at the RADOS level, so
you can happily save lots of cash on your servers. I suggest reading
through the various Ceph hardware recommendations that you can find
via Google.


Hi,

Just on this point, sure that RAID6 is a bad configuration for using 
with ceph but to use disks "directly" there is JBOD but also the 
hardware RAID0 configuration (depending on the SAS controler), right ?


rv


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO to OSD with librados

2018-06-19 Thread Jialin Liu
Thanks for the advice, Dan.

I'll try to reconfigure the cluster and see if the performance changes.

Best,
Jialin

On Tue, Jun 19, 2018 at 12:02 AM Dan van der Ster 
wrote:

> On Tue, Jun 19, 2018 at 1:04 AM Jialin Liu  wrote:
> >
> > Hi Dan, Thanks for the follow-ups.
> >
> > I have just tried running multiple librados MPI applications from
> multiple nodes, it does show increased bandwidth,
> > with ceph -w, I observed as high as 500MB/sec (previously only 160MB/sec
> ), I think I can do finer tuning by
> > coordinating more concurrent applications to get the peak. (Sorry, I
> only have one node having rados cli installed, so I can't follow your
> example to stress the server)
> >
> >> Then you can try different replication or erasure coding settings to
> >> learn their impact on performance...
> >
> >
> > Good points.
> >
> >>
> >> PPS. What are those 21.8TB devices ?
> >
> >
> > The storage arrays are Nexsan E60 arrays having two active-active
> redundant
> > controllers, 60 3 TB disk drives. The disk drives are organized into six
> 8+2
> > Raid 6 LUNs of 24 TB each.
> >
>
> This is not the ideal Ceph hardware. Ceph is designed to use disks
> directly -- JBODs. All redundancy is handled at the RADOS level, so
> you can happily save lots of cash on your servers. I suggest reading
> through the various Ceph hardware recommendations that you can find
> via Google.
>
> I can't tell from here if this is the root cause of your performance
> issue -- but you should plan future clusters to use JBODs instead of
> expensive arrays.
>
> >
> >>
> >> PPPS. Any reason you are running jewel instead of luminous or mimic?
> >
> >
> > I have to ask the cluster admin, I'm not sure about it.
> >
> > I have one more questions, regarding the OSD server and OSDs, I was told
> that the IO has to go through the 4 OSD servers (hosts), before touching
> the OSDs,
> > This is confusing to me, as I learned from the ceph document
> http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/#monitoring-osds
> > the librados can talk to the OSDs directly, what am I missing here?
>
> You should have one ceph-osd process per disk (or per LUN in your
> case). The clients connect to the ceph-osd processes directly.
>
> -- dan
>
>
> >
> >
> > Best,
> > Jialin
> > NERSC/LBNL
> >
> >
> >>
> >> On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu  wrote:
> >> >
> >> > Hi, To make the the problem clearer, here is the configuration of the
> cluster:
> >> >
> >> > The 'problem' I have is the low bandwidth no matter how I increase
> the concurrency.
> >> > I have tried using MPI to launch 322 processes, each calling librados
> to create a handle and initialize the io context, and write one 80MB object.
> >> > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm
> wondering if the number of client-osd connection is limited by the number
> of hosts.
> >> >
> >> > Best,
> >> > Jialin
> >> > NERSC/LBNL
> >> >
> >> > $ceph osd tree
> >> >
> >> > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >> >
> >> > -1 1047.59473 root default
> >> >
> >> > -2  261.89868 host ngfdv036
> >> >
> >> >  0   21.82489 osd.0  up  1.0  1.0
> >> >
> >> >  4   21.82489 osd.4  up  1.0  1.0
> >> >
> >> >  8   21.82489 osd.8  up  1.0  1.0
> >> >
> >> > 12   21.82489 osd.12 up  1.0  1.0
> >> >
> >> > 16   21.82489 osd.16 up  1.0  1.0
> >> >
> >> > 20   21.82489 osd.20 up  1.0  1.0
> >> >
> >> > 24   21.82489 osd.24 up  1.0  1.0
> >> >
> >> > 28   21.82489 osd.28 up  1.0  1.0
> >> >
> >> > 32   21.82489 osd.32 up  1.0  1.0
> >> >
> >> > 36   21.82489 osd.36 up  1.0  1.0
> >> >
> >> > 40   21.82489 osd.40 up  1.0  1.0
> >> >
> >> > 44   21.82489 osd.44 up  1.0  1.0
> >> >
> >> > -3  261.89868 host ngfdv037
> >> >
> >> >  1   21.82489 osd.1  up  1.0  1.0
> >> >
> >> >  5   21.82489 osd.5  up  1.0  1.0
> >> >
> >> >  9   21.82489 osd.9  up  1.0  1.0
> >> >
> >> > 13   21.82489 osd.13 up  1.0  1.0
> >> >
> >> > 17   21.82489 osd.17 up  1.0  1.0
> >> >
> >> > 21   21.82489 osd.21 up  1.0  1.0
> >> >
> >> > 25   21.82489 osd.25 up  1.0  1.0
> >> >
> >> > 29   21.82489 osd.29 up  1.0  1.0
> >> >
> >> > 33   21.82489 osd.33 up  1.0  1.0
> >> >
> >> > 37   21.82489 osd.37 up  1.0  1.0
> >> >
> >> > 41   21.82489 osd.41 up  1.0 

Re: [ceph-users] IO to OSD with librados

2018-06-18 Thread Jialin Liu
Hi Dan, Thanks for the follow-ups.

I have just tried running multiple librados MPI applications from multiple
nodes, it does show increased bandwidth,
with ceph -w, I observed as high as 500MB/sec (previously only 160MB/sec ),
I think I can do finer tuning by
coordinating more concurrent applications to get the peak. (Sorry, I only
have one node having rados cli installed, so I can't follow your example to
stress the server)

Then you can try different replication or erasure coding settings to
> learn their impact on performance...
>

Good points.


> PPS. What are those 21.8TB devices ?
>

The storage arrays are Nexsan E60 arrays having two active-active redundant
controllers, 60 3 TB disk drives. The disk drives are organized into six 8+2
Raid 6 LUNs of 24 TB each.



> PPPS. Any reason you are running jewel instead of luminous or mimic?
>

I have to ask the cluster admin, I'm not sure about it.

I have one more questions, regarding the OSD server and OSDs, I was told
that the IO has to go through the 4 OSD servers (hosts), before touching
the OSDs,
This is confusing to me, as I learned from the ceph document
http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/#monitoring-osds
the librados can talk to the OSDs directly, what am I missing here?


Best,
Jialin
NERSC/LBNL



> On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu  wrote:
> >
> > Hi, To make the the problem clearer, here is the configuration of the
> cluster:
> >
> > The 'problem' I have is the low bandwidth no matter how I increase the
> concurrency.
> > I have tried using MPI to launch 322 processes, each calling librados to
> create a handle and initialize the io context, and write one 80MB object.
> > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm
> wondering if the number of client-osd connection is limited by the number
> of hosts.
> >
> > Best,
> > Jialin
> > NERSC/LBNL
> >
> > $ceph osd tree
> >
> > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >
> > -1 1047.59473 root default
> >
> > -2  261.89868 host ngfdv036
> >
> >  0   21.82489 osd.0  up  1.0  1.0
> >
> >  4   21.82489 osd.4  up  1.0  1.0
> >
> >  8   21.82489 osd.8  up  1.0  1.0
> >
> > 12   21.82489 osd.12 up  1.0  1.0
> >
> > 16   21.82489 osd.16 up  1.0  1.0
> >
> > 20   21.82489 osd.20 up  1.0  1.0
> >
> > 24   21.82489 osd.24 up  1.0  1.0
> >
> > 28   21.82489 osd.28 up  1.0  1.0
> >
> > 32   21.82489 osd.32 up  1.0  1.0
> >
> > 36   21.82489 osd.36 up  1.0  1.0
> >
> > 40   21.82489 osd.40 up  1.0  1.0
> >
> > 44   21.82489 osd.44 up  1.0  1.0
> >
> > -3  261.89868 host ngfdv037
> >
> >  1   21.82489 osd.1  up  1.0  1.0
> >
> >  5   21.82489 osd.5  up  1.0  1.0
> >
> >  9   21.82489 osd.9  up  1.0  1.0
> >
> > 13   21.82489 osd.13 up  1.0  1.0
> >
> > 17   21.82489 osd.17 up  1.0  1.0
> >
> > 21   21.82489 osd.21 up  1.0  1.0
> >
> > 25   21.82489 osd.25 up  1.0  1.0
> >
> > 29   21.82489 osd.29 up  1.0  1.0
> >
> > 33   21.82489 osd.33 up  1.0  1.0
> >
> > 37   21.82489 osd.37 up  1.0  1.0
> >
> > 41   21.82489 osd.41 up  1.0  1.0
> >
> > 45   21.82489 osd.45 up  1.0  1.0
> >
> > -4  261.89868 host ngfdv038
> >
> >  2   21.82489 osd.2  up  1.0  1.0
> >
> >  6   21.82489 osd.6  up  1.0  1.0
> >
> > 10   21.82489 osd.10 up  1.0  1.0
> >
> > 14   21.82489 osd.14 up  1.0  1.0
> >
> > 18   21.82489 osd.18 up  1.0  1.0
> >
> > 22   21.82489 osd.22 up  1.0  1.0
> >
> > 26   21.82489 osd.26 up  1.0  1.0
> >
> > 30   21.82489 osd.30 up  1.0  1.0
> >
> > 34   21.82489 osd.34 up  1.0  1.0
> >
> > 38   21.82489 osd.38 up  1.0  1.0
> >
> > 42   21.82489 osd.42 up  1.0  1.0
> >
> > 46   21.82489 osd.46 up  1.0  1.0
> >
> > -5  261.89868 host ngfdv039
> >
> >  3   21.82489 osd.3  up  1.0  1.0
> >
> >  7   21.82489 osd.7  up  1.0  

Re: [ceph-users] IO to OSD with librados

2018-06-18 Thread Dan van der Ster
Hi,

Have you tried running rados bench in parallel from several client
machines? That would demonstrate the full BW capacity of the cluster.

e.g. make a test pool with e.g. 256 PGs (which will average 16 per OSD
on your cluster).
Then from several clients at once do `rados bench -p test 60 write`.
And at the same time  `watch ceph status` to see the total bandwidth.

Then you can try different replication or erasure coding settings to
learn their impact on performance...

-- dan

P.S. two mons is never a good idea. Use 3.

PPS. What are those 21.8TB devices ?

PPPS. Any reason you are running jewel instead of luminous or mimic?

On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu  wrote:
>
> Hi, To make the the problem clearer, here is the configuration of the cluster:
>
> The 'problem' I have is the low bandwidth no matter how I increase the 
> concurrency.
> I have tried using MPI to launch 322 processes, each calling librados to 
> create a handle and initialize the io context, and write one 80MB object.
> I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm wondering 
> if the number of client-osd connection is limited by the number of hosts.
>
> Best,
> Jialin
> NERSC/LBNL
>
> $ceph osd tree
>
> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>
> -1 1047.59473 root default
>
> -2  261.89868 host ngfdv036
>
>  0   21.82489 osd.0  up  1.0  1.0
>
>  4   21.82489 osd.4  up  1.0  1.0
>
>  8   21.82489 osd.8  up  1.0  1.0
>
> 12   21.82489 osd.12 up  1.0  1.0
>
> 16   21.82489 osd.16 up  1.0  1.0
>
> 20   21.82489 osd.20 up  1.0  1.0
>
> 24   21.82489 osd.24 up  1.0  1.0
>
> 28   21.82489 osd.28 up  1.0  1.0
>
> 32   21.82489 osd.32 up  1.0  1.0
>
> 36   21.82489 osd.36 up  1.0  1.0
>
> 40   21.82489 osd.40 up  1.0  1.0
>
> 44   21.82489 osd.44 up  1.0  1.0
>
> -3  261.89868 host ngfdv037
>
>  1   21.82489 osd.1  up  1.0  1.0
>
>  5   21.82489 osd.5  up  1.0  1.0
>
>  9   21.82489 osd.9  up  1.0  1.0
>
> 13   21.82489 osd.13 up  1.0  1.0
>
> 17   21.82489 osd.17 up  1.0  1.0
>
> 21   21.82489 osd.21 up  1.0  1.0
>
> 25   21.82489 osd.25 up  1.0  1.0
>
> 29   21.82489 osd.29 up  1.0  1.0
>
> 33   21.82489 osd.33 up  1.0  1.0
>
> 37   21.82489 osd.37 up  1.0  1.0
>
> 41   21.82489 osd.41 up  1.0  1.0
>
> 45   21.82489 osd.45 up  1.0  1.0
>
> -4  261.89868 host ngfdv038
>
>  2   21.82489 osd.2  up  1.0  1.0
>
>  6   21.82489 osd.6  up  1.0  1.0
>
> 10   21.82489 osd.10 up  1.0  1.0
>
> 14   21.82489 osd.14 up  1.0  1.0
>
> 18   21.82489 osd.18 up  1.0  1.0
>
> 22   21.82489 osd.22 up  1.0  1.0
>
> 26   21.82489 osd.26 up  1.0  1.0
>
> 30   21.82489 osd.30 up  1.0  1.0
>
> 34   21.82489 osd.34 up  1.0  1.0
>
> 38   21.82489 osd.38 up  1.0  1.0
>
> 42   21.82489 osd.42 up  1.0  1.0
>
> 46   21.82489 osd.46 up  1.0  1.0
>
> -5  261.89868 host ngfdv039
>
>  3   21.82489 osd.3  up  1.0  1.0
>
>  7   21.82489 osd.7  up  1.0  1.0
>
> 11   21.82489 osd.11 up  1.0  1.0
>
> 15   21.82489 osd.15 up  1.0  1.0
>
> 19   21.82489 osd.19 up  1.0  1.0
>
> 23   21.82489 osd.23 up  1.0  1.0
>
> 27   21.82489 osd.27 up  1.0  1.0
>
> 31   21.82489 osd.31 up  1.0  1.0
>
> 35   21.82489 osd.35 up  1.0  1.0
>
> 39   21.82489 osd.39 up  1.0  1.0
>
> 43   21.82489 osd.43 up  1.0  1.0
>
> 47   21.82489 osd.47 up  1.0  1.0
>
>
> ceph -s
>
> cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a
>
>  health HEALTH_OK
>
>  monmap e1: 2 mons at 
> 

Re: [ceph-users] IO to OSD with librados

2018-06-18 Thread Jialin Liu
Hi, To make the the problem clearer, here is the configuration of the
cluster:

The 'problem' I have is the low bandwidth no matter how I increase the
concurrency.
I have tried using MPI to launch 322 processes, each calling librados to
create a handle and initialize the io context, and write one 80MB object.
I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm
wondering if the number of client-osd connection is limited by the number
of hosts.

Best,
Jialin
NERSC/LBNL

$ceph osd tree

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 1047.59473 root default

-2  261.89868 host ngfdv036

 0   21.82489 osd.0  up  1.0  1.0

 4   21.82489 osd.4  up  1.0  1.0

 8   21.82489 osd.8  up  1.0  1.0

12   21.82489 osd.12 up  1.0  1.0

16   21.82489 osd.16 up  1.0  1.0

20   21.82489 osd.20 up  1.0  1.0

24   21.82489 osd.24 up  1.0  1.0

28   21.82489 osd.28 up  1.0  1.0

32   21.82489 osd.32 up  1.0  1.0

36   21.82489 osd.36 up  1.0  1.0

40   21.82489 osd.40 up  1.0  1.0

44   21.82489 osd.44 up  1.0  1.0

-3  261.89868 host ngfdv037

 1   21.82489 osd.1  up  1.0  1.0

 5   21.82489 osd.5  up  1.0  1.0

 9   21.82489 osd.9  up  1.0  1.0

13   21.82489 osd.13 up  1.0  1.0

17   21.82489 osd.17 up  1.0  1.0

21   21.82489 osd.21 up  1.0  1.0

25   21.82489 osd.25 up  1.0  1.0

29   21.82489 osd.29 up  1.0  1.0

33   21.82489 osd.33 up  1.0  1.0

37   21.82489 osd.37 up  1.0  1.0

41   21.82489 osd.41 up  1.0  1.0

45   21.82489 osd.45 up  1.0  1.0

-4  261.89868 host ngfdv038

 2   21.82489 osd.2  up  1.0  1.0

 6   21.82489 osd.6  up  1.0  1.0

10   21.82489 osd.10 up  1.0  1.0

14   21.82489 osd.14 up  1.0  1.0

18   21.82489 osd.18 up  1.0  1.0

22   21.82489 osd.22 up  1.0  1.0

26   21.82489 osd.26 up  1.0  1.0

30   21.82489 osd.30 up  1.0  1.0

34   21.82489 osd.34 up  1.0  1.0

38   21.82489 osd.38 up  1.0  1.0

42   21.82489 osd.42 up  1.0  1.0

46   21.82489 osd.46 up  1.0  1.0

-5  261.89868 host ngfdv039

 3   21.82489 osd.3  up  1.0  1.0

 7   21.82489 osd.7  up  1.0  1.0

11   21.82489 osd.11 up  1.0  1.0

15   21.82489 osd.15 up  1.0  1.0

19   21.82489 osd.19 up  1.0  1.0

23   21.82489 osd.23 up  1.0  1.0

27   21.82489 osd.27 up  1.0  1.0

31   21.82489 osd.31 up  1.0  1.0

35   21.82489 osd.35 up  1.0  1.0

39   21.82489 osd.39 up  1.0  1.0

43   21.82489 osd.43 up  1.0  1.0

47   21.82489 osd.47 up  1.0  1.0

ceph -s

cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a

 health HEALTH_OK

 monmap e1: 2 mons at
{ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0}

election epoch 4, quorum 0,1 ngfdv076,ngfdv078

 osdmap e280: 48 osds: 48 up, 48 in

flags sortbitwise,require_jewel_osds

  pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects

79218 MB used, 1047 TB / 1047 TB avail

3136 active+clean


On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu  wrote:

> Thank you Dan. I’ll try it.
>
> Best,
> Jialin
> NERSC/LBNL
>
> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster 
> wrote:
> >
> > Hi,
> >
> > One way you can see exactly what is happening when you write an object
> > is with --debug_ms=1.
> >
> > For example, I write a 100MB object to a test pool:  rados
> > --debug_ms=1 -p test put 100M.dat 100M.dat
> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV
> > In this case, it first gets the cluster maps from a mon, then writes
> > the object to 

Re: [ceph-users] IO to OSD with librados

2018-06-18 Thread Jialin Liu
Thank you Dan. I’ll try it.

Best,
Jialin
NERSC/LBNL

> On Jun 18, 2018, at 12:22 AM, Dan van der Ster  wrote:
> 
> Hi,
> 
> One way you can see exactly what is happening when you write an object
> is with --debug_ms=1.
> 
> For example, I write a 100MB object to a test pool:  rados
> --debug_ms=1 -p test put 100M.dat 100M.dat
> I pasted the output of this here: https://pastebin.com/Zg8rjaTV
> In this case, it first gets the cluster maps from a mon, then writes
> the object to osd.58, which is the primary osd for PG 119.77:
> 
> # ceph pg 119.77 query | jq .up
> [
>  58,
>  49,
>  31
> ]
> 
> Otherwise I answered your questions below...
> 
>> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu  wrote:
>> 
>> Hello,
>> 
>> I have a couple questions regarding the IO on OSD via librados.
>> 
>> 
>> 1. How to check which osd is receiving data?
>> 
> 
> See `ceph osd map`.
> For my example above:
> 
> # ceph osd map test 100M.dat
> osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77
> (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58)
> 
>> 2. Can the write operation return immediately to the application once the 
>> write to the primary OSD is done? or does it return only when the data is 
>> replicated twice? (size=3)
> 
> Write returns once it is safe on *all* replicas or EC chunks.
> 
>> 3. What is the I/O size in the lower level in librados, e.g., if I send a 
>> 100MB request with 1 thread, does librados send the data by a fixed 
>> transaction size?
> 
> This depends on the client. The `rados` CLI example I showed you broke
> the 100MB object into 4MB parts.
> Most use-cases keep the objects around 4MB or 8MB.
> 
>> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from the 
>> ceph documentation, once the cluster map is received by the client, the 
>> client can talk to OSD directly, so the assumption is the max parallelism 
>> depends on the number of OSDs, is this correct?
>> 
> 
> That's more or less correct -- the IOPS and BW capacity of the cluster
> generally scales linearly with number of OSDs.
> 
> Cheers,
> Dan
> CERN
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO to OSD with librados

2018-06-18 Thread Dan van der Ster
Hi,

One way you can see exactly what is happening when you write an object
is with --debug_ms=1.

For example, I write a 100MB object to a test pool:  rados
--debug_ms=1 -p test put 100M.dat 100M.dat
I pasted the output of this here: https://pastebin.com/Zg8rjaTV
In this case, it first gets the cluster maps from a mon, then writes
the object to osd.58, which is the primary osd for PG 119.77:

# ceph pg 119.77 query | jq .up
[
  58,
  49,
  31
]

Otherwise I answered your questions below...

On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu  wrote:
>
> Hello,
>
> I have a couple questions regarding the IO on OSD via librados.
>
>
> 1. How to check which osd is receiving data?
>

See `ceph osd map`.
For my example above:

# ceph osd map test 100M.dat
osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77
(119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58)

> 2. Can the write operation return immediately to the application once the 
> write to the primary OSD is done? or does it return only when the data is 
> replicated twice? (size=3)

Write returns once it is safe on *all* replicas or EC chunks.

> 3. What is the I/O size in the lower level in librados, e.g., if I send a 
> 100MB request with 1 thread, does librados send the data by a fixed 
> transaction size?

This depends on the client. The `rados` CLI example I showed you broke
the 100MB object into 4MB parts.
Most use-cases keep the objects around 4MB or 8MB.

> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from the ceph 
> documentation, once the cluster map is received by the client, the client can 
> talk to OSD directly, so the assumption is the max parallelism depends on the 
> number of OSDs, is this correct?
>

That's more or less correct -- the IOPS and BW capacity of the cluster
generally scales linearly with number of OSDs.

Cheers,
Dan
CERN
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO to OSD with librados

2018-06-18 Thread Jialin Liu
Sorry about the misused term 'OSS: object storage server' (a term often
used in Lustre filesystem), what I meant is 4 hosts, each manages 12 OSDs.
Thanks for anyone who may answer any of my questions.

Best,
Jialin
NERSC/LBNL

On Sun, Jun 17, 2018 at 11:29 AM Jialin Liu  wrote:

> Hello,
>
> I have a couple questions regarding the IO on OSD via librados.
>
>
> 1. How to check which osd is receiving data?
>
> 2. Can the write operation return immediately to the application once the
> write to the primary OSD is done? or does it return only when the data is
> replicated twice? (size=3)
>
> 3. What is the I/O size in the lower level in librados, e.g., if I send a
> 100MB request with 1 thread, does librados send the data by a fixed
> transaction size?
>
> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from the
> ceph documentation, once the cluster map is received by the client, the
> client can talk to OSD directly, so the assumption is the max parallelism
> depends on the number of OSDs, is this correct?
>
>
> Best,
>
> Jialin
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com