Re: [ceph-users] Bulk storage use case

Craig Lewis Fri, 09 May 2014 19:30:31 -0700

I'm still a noob too, so don't take anything I say with much weight. Iwas hoping that somebody with more experience would reply.


I see a few potential problems.

With that CPU to disk ratio, you're going to need to slow recovery downa lot to make sure you have enough CPU available after a node reboots.You may need to tune it down even further in the event that a nodefails. I haven't tested a CPU starvation situation, but I suspect thatbad things would happen. You might get stuck with OSDs not respondingfast enough, so they get marked down, which triggers a recovery, whichuses more CPU, etc. I'm not even sure how you'd get out of thatsituation if it started.

Regarding I/O, your writes being sequential won't matter. By usingjournals on the HDDs, all I/O becomes random I/O. You have a lot ofspindles though. Doing some quick estimate in my head, I figure thatyou realistically have 200 MBps of I/O per node. It seems pretty lowcompared to the combined sequential write speed of 3.6 GBps. Justremember that every write to an OSD is really two writes, which meansyou're doing random IO. 10 MBps per disk, divided by the 2 writes,becomes 5 MBps per disk. Plus the latency of sending the data over thenetwork to 1 or 2 other disks that have the same constraints.

With replication = 2, that's 100 MBps per node. That ends up being(best case) about 800 Mbps of RadosGW writes. Hotspotting and unevendistribution on the nodes will lower that number. If 1 Gbps writes pernode are a hard requirement, I think you're going to be disappointed.If your application requirements are lower, then you should be ok.

Regarding latency, it's hard to get specific. Just remember that yourdata is being stripped across many disks. So the latency of one RadosGWoperation some where between the max latency of the OSDs, and the sum ofthe latency of the OSDs. Like I said, hard to be specific. To beginwith latency will just increase as the load increases. But at a certainpoint, problems will start. OSDs will block because another OSDs won'twrite it's data. Your RadosGW load balancer might mark RadosGW nodesdown because they're unresponsive. OSDs might kick other OSDs outbecause they're too slow. Most of my Ceph headaches involve too muchlatency.

Overall, I think you'll be ok, unless you absolutely have to have that 1Gbps write speed per node. Even so, you'll need to prove it. Youreally want to load up the cluster with a real amount of data, thensimulate a failure and recovery under normal load. Shut a node down fora day, then bring it back up. Stuff like that. A real productionfailure will stress things differently than `ceph bench` does. I madethe mistake of testing without enough data. Things worked great whenthe cluster was 5% used, but had problems when the cluster was 60% used.





On 5/9/14 04:18 , Cédric Lemarchand wrote:

An other thought, I would hope that with EC, data chunks spreads wouldprofits of each drives writes capability where there will be stored.
I did not get any rely for now ! Does this kind of configuration (hard& soft) looks crazy ?! Am I missing something ?
Looking forward for your comments, thanks in advance.

--
Cédric Lemarchand
Le 7 mai 2014 à 22:10, Cedric Lemarchand <[email protected]<mailto:[email protected]>> a écrit :
Some more details, the io pattern will be around 90%write 10%read,mainly sequential.Recent posts shows that max_backfills, recovery_max_active andrecovery_op_priority settings will be helpful in case ofbackfilling/re balancing.
Any thoughts on such hardware setup ?

Le 07/05/2014 11:43, Cedric Lemarchand a écrit :
Hello,
This build is only intended for archiving purpose, what matter hereis lowering ratio $/To/W.Access to the storage would be via radosgw, installed on each nodes.I need that each nodes sustain an average of 1Gb write rates, forwhich I think it would not be a problem. Erasure encoding will beused with something like k=12 m=3.
A typical node would be :

- Supermicro 36 bays
- 2x Xeon E5-2630Lv2
- 96Go ram (recommended ratio 1Go/To for OSD is lowered a bit ... )
- HBA LSI adaptaters, JBOD mode, could be 2x 9207-8i
- 36 HDD 4To with default journals config
- dedicated bonded 2Gb links for public/private networks(backfilling will takes ages if a full node is lost ...)
I think in an *optimal* state (ceph healthy), it could handle thejob. Waiting for your comment.
What is bothering me more is cases of OSD maintenance operationslike backfilling and cluster re balancing, where nodes will be putunder very hight IO/memory and CPU load during hours/days. Does thelatency will *just* grow up, or does everything will fly away ?(OOMK spawn, OSD suicides because of latency, node pushed out of thecluster, ect ... )
As you understand I am trying to design the cluster with in mind asweet spot like "things becomes slow, latency grow up, but the nodestay stable/usable and aren't pushed out of the cluster".
This is my first jump into Ceph, so any inputs will be greatlyappreciated ;-)
Cheers,

--
Cédric


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Cédric
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>

*Central Desktop. Work together in ways you never thought possible.*

Connect with us Website <http://www.centraldesktop.com/> | Twitter<http://www.twitter.com/centraldesktop> | Facebook<http://www.facebook.com/CentralDesktop> | LinkedIn<http://www.linkedin.com/groups?gid=147417> | Blog<http://cdblog.centraldesktop.com/>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bulk storage use case

Reply via email to