I use 2U servers with 9x 3.5" spinning disks in each. This has scaled well for me, in both performance and budget.
I may add 3 more spinning disks to each server at a later time if I need to maximize storage, or I may add 3 SSDs for journals/cache tier if we need better performance. Another consideration is failure domain. If you had a server crash, how much of your cluster will go down? Some good advice I've read on this forum is no single OSD server should be more than 10% of the cluster. I had taken a week off and one of my 12 OSD servers had an OS SD card fail, which took down the server. No one even noticed it went down. None of the VM clients had any performance issues and no data was lost (3x replication). I have the recovery settings turned down as low as possible, and even so it only took about 6 hours to rebuild. Speaking of rebuilding, do your performance measurements during a rebuild. This has been the time when the cluster is the most stressed and when performance is the most important. There's a lot to think about. Read through the archives of this mailing list, there is a lot of useful advice! Jake On Sat, Jan 7, 2017 at 1:38 PM Maged Mokhtar <[email protected]> wrote: > > > Adding more nodes is best if you have unlimited budget :) > You should add more osds per node until you start hitting cpu or network > bottlenecks. Use a perf tool like atop/sysstat to know when this happens. > > > > > -------- Original message -------- > From: kevin parrikar <[email protected]> > Date: 07/01/2017 19:56 (GMT+02:00) > To: Lionel Bouton <[email protected]> > Cc: [email protected] > Subject: Re: [ceph-users] Analysing ceph performance with SSD journal, > 10gbe NIC and 2 replicas -Hammer release > > Wow thats a lot of good information. I wish i knew about all these before > investing on all these devices.Since i dont have any other option,will get > better SSD and faster HDD . > I have one more generic question about Ceph. > To increase the throughput of a cluster what is the standard practice is > it more osd "per" node or more osd "nodes". > > Thanks alot for all your help.Learned so many new things thanks again > > Kevin > > On Sat, Jan 7, 2017 at 7:33 PM, Lionel Bouton < > [email protected]> wrote: > > > > > > > Le 07/01/2017 à 14:11, kevin parrikar a > écrit : > > > > Thanks for your valuable input. > > We were using these SSD in our NAS box(synology) and it was > giving 13k iops for our fileserver in raid1.We had a few spare > disks which we added to our ceph nodes hoping that it will give > good performance same as that of NAS box.(i am not comparing NAS > with ceph ,just the reason why we decided to use these SSD) > > > > We dont have S3520 or S3610 at > the moment but can order one of these to see how it performs > in ceph .We have 4xS3500 80Gb handy. > > If i create a 2 node cluster with 2xS3500 each and with > replica of 2,do you think it can deliver 24MB/s of 4k writes . > > > > > > Probably not. See > > http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > > > > According to the page above the DC S3500 reaches 39MB/s. Its > capacity isn't specified, yours are 80GB only which is the lowest > capacity I'm aware of and for all DC models I know of the speed goes > down with the capacity so you probably will get lower than that. > > If you put both data and journal on the same device you cut your > bandwidth in half : so this would give you an average <20MB/s per > OSD (with occasional peaks above that if you don't have a sustained > 20MB/s). With 4 OSDs and size=2, your total write bandwidth is > <40MB/s. For a single stream of data you will only get <20MB/s > though (you won't benefit from parallel writes to the 4 OSDs and > will only write on 2 at a time). > > > > Not that by comparison the 250GB 840 EVO only reaches 1.9MB/s. > > > > But even if you reach the 40MB/s, these models are not designed for > heavy writes, you will probably kill them long before their warranty > is expired (IIRC these are rated for ~24GB writes per day over the > warranty period). In your configuration you only have to write 24G > each day (as you have 4 of them, write both to data and journal and > size=2) to be in this situation (this is an average of only 0.28 > MB/s compared to your 24 MB/s target). > > > > > We bought S3500 > because last time when we tried ceph, people were suggesting > this model :) :) > > > > > > The 3500 series might be enough with the higher capacities in some > rare cases but the 80GB model is almost useless. > > > > You have to do the math considering : > > - how much you will write to the cluster (guess high if you have to > guess), > > - if you will use the SSD for both journals and data (which means > writing twice on them), > > - your replication level (which means you will write multiple times > the same data), > > - when you expect to replace the hardware, > > - the amount of writes per day they support under warranty (if the > manufacturer doesn't present this number prominently they probably > are trying to sell you a fast car headed for a brick wall) > > > > If your hardware can't handle the amount of write you expect to put > in it then you are screwed. There were reports of new Ceph users not > aware of this and using cheap SSDs that failed in a matter of months > all at the same time. You definitely don't want to be in their > position. > > In fact as problems happen (hardware failure leading to cluster > storage rebalancing for example) you should probably get a system > able to handle 10x the amount of writes you expect it to handle and > then monitor the SSD SMART attributes to be alerted long before they > die and replace them before problems happen. You definitely want a > controller allowing access to this information. If you can't you > will have to monitor the writes and guess this value which is risky > as write amplification inside SSDs is not easy to guess... > > > > Lionel > > > > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
