[ceph-users] Did maximum performance reached?

Shneur Zalman Mattern Tue, 28 Jul 2015 02:42:08 -0700

Hi, Johannes (that's my grandpa's name)

The size is 2, do you really think that number of replicas can increase 
performance?
on the  http://ceph.com/docs/master/architecture/
written "Note: Striping is independent of object replicas. Since CRUSH 
replicates objects across OSDs, stripes get replicated automatically. "


OK, I'll check it,
Regards, Shneur
________________________________________
From: Johannes Formann <[email protected]>
Sent: Tuesday, July 28, 2015 12:09 PM
To: Shneur Zalman Mattern
Cc: [email protected]
Subject: Re: [ceph-users] Did maximum performance reached?

Hello,

what is the „size“ parameter of your pool?

Some math do show the impact:
size=3 means each write is written 6 times (3 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

3 (size) * 1300 MB/s / 6 (SSD) => 650MB/s per SSD
3 (size) * 1300 MB/s / 30 (HDD) => 130MB/s per HDD

If you use size=3, the results are as good as one can expect. (Even with size=2 
the results won’t be bad)

greetings

Johannes

> Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern <[email protected]>:
>
> We've built Ceph cluster:
>     3 mon nodes (one of them is combined with mds)
>     3 osd nodes (each one have 10 osd + 2 ssd for journaling)
>     switch 24 ports x 10G
>     10 gigabit - for public network
>     20 gigabit bonding - between osds
>     Ubuntu 12.04.05
>     Ceph 0.87.2
> -----------------------------------------------------
> Clients has:
>     10 gigabit for ceph-connection
>     CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule
>
>
>
> ====== fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===========
> Single client:
> ++++++++++++++++++++++++++++
>
> Starting 16 processes
>
> .....below is just 1 job info....
> trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
> 13:26:24 2015
>   write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
>     slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
>     clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
>      lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
>     clat percentiles (usec):
>      |  1.00th=[    1],  5.00th=[    2], 10.00th=[    2], 20.00th=[    2],
>      | 30.00th=[    3], 40.00th=[    3], 50.00th=[    3], 60.00th=[    4],
>      | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    6],
>      | 99.00th=[    9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
>      | 99.99th=[   62]
>     bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
> stdev=26397.76
>     lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
>     lat (usec) : 100=0.03%
>   cpu          : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>      issued    : total=r=0/w=10240/d=0, short=r=0/w=0/d=0
>
> ...what's above repeats 16 times...
>
> Run status group 0 (all jobs):
>   WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
> mint=133312msec, maxt=134329msec
>
> +++++++++++++++++++++++++++++++++
> Two clients:
> +++++++++++++++++++++++++++++++++
> ....below is just 1 job info....
> trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
> 14:05:59 2015
>   write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
>     slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
>     clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
>      lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
>     clat percentiles (usec):
>      |  1.00th=[    2],  5.00th=[    2], 10.00th=[    2], 20.00th=[    2],
>      | 30.00th=[    3], 40.00th=[    3], 50.00th=[    3], 60.00th=[    4],
>      | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    6],
>      | 99.00th=[    8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
>      | 99.99th=[   56]
>     bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
> stdev=21905.92
>     lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
>     lat (usec) : 100=0.03%
>   cpu          : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>      issued    : total=r=0/w=10240/d=0, short=r=0/w=0/d=0
>
> ...what's above repeats 16 times...
>
> Run status group 0 (all jobs):
>   WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
> mint=242331msec, maxt=243869msec
>
> --------- And almost the same(?!) aggregated result from the second client: 
> ---------
>
> Run status group 0 (all jobs):
>   WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
> mint=244697msec, maxt=246941msec
>
> ----------------- If I'll summarize: ---------------------
> aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s
>
> it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it 
> was divided? why?
> Question: If I'll connect 12 clients nodes - each one can write just on 
> 100MB/s?
> Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and 
> it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not?
>
> ============================================================================
>
> health HEALTH_OK
>      monmap e1: 3 mons at 
> {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0},
>  election epoch 140, quorum 0,1,2 mon1,mon2,mon3
>      mdsmap e12: 1/1/1 up {0=mon3=up:active}
>      osdmap e832: 31 osds: 30 up, 30 in
>       pgmap v106186: 6144 pgs, 3 pools, 2306 GB data, 1379 kobjects
>             4624 GB used, 104 TB / 109 TB avail
>                 6144 active+clean
>
>
> Perhaps, I don't understand something in Ceph architecture? I thought, that:
>
> Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node 
> = aggregated write speed is ~ 900MB/s (because of striping etc.)
> And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought 
> it's also aggregateble and we'll get something around 2.5 GB/s, but not...
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ************************************************************************************
> This footnote confirms that this email message has been scanned by
> PineApp Mail-SeCure for the presence of malicious code, vandals & computer 
> viruses.
> ************************************************************************************
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




************************************************************************************
This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals & computer 
viruses.
************************************************************************************




 
 
************************************************************************************
This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals & computer 
viruses.
************************************************************************************



_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Did maximum performance reached?

Reply via email to