Hello, On Fri, 27 Jun 2014 11:00:58 -0700 Erich Weiler wrote:
> Hi Folks, > > We're going to spin up a ceph cluster with the following general specs: > > * Six 10Gb/s connected servers, each with 45 4TB disks in a JBOD > Interesting amount of disks, what case/server is this? > * Each disk is an OSD, so 45 OSDs per server > > * So 45*6 = 270 OSDs total > As Udo pointed out, that's quite a large number of OSDs, which would need to backed up by adequate CPU and RAM resources (as much as you can afford of the later for read caching). And like Udo I'd recommend running some kind of monitoring system that can set the cluster to no-out if a node goes down unexpectedly, in nearly all cases that will be the better cause of action than letting Ceph redistribute dozens of TB only to have them redistributed again once the node recovers. > * Three separate, dedicated monitor nodes > Nothing wrong with that, doesn't need that much in terms of resources, a FAST CPU with just a few cores and storage on SSDs for the leveldb will do the trick. > The files stored on this storage cluster will be large file, each file > will be several GB in size at the minimum, with some files being over > 100GB. > Now this is where it gets interesting. How are you storing/accessing these files? CephFS? RBD volumes mounted into hosts (kernelspace)? RBD volumes for VMs (userspace)? What kind of access pattern are you expecting? Just looking at these sizes something like a virtual tape library for backups comes to mind, followed by scientific data (given your domain). ^o^ If you're expecting mostly large reads/writes that are also more or less sequential you're not going to be starved for IOPS and given that you're not planning on using SSD backed journals hints that way, too. In case my assumption up there is correct, consider this alternative: Put 2 good (I like Areca 1882, but whatever cranks your tractor) RAID controllers with lots of HW cache in there. Create 4x 10disk RAID6, thus 4 OSDs. Use the rest of the drive slots for hot spares and 2-4 for SSDs to hold the journals. Use FAST, Intel DC SSDs, preferably on another controller if possible. You want the combined SSDs to be fast enough to handle your network bandwidth, so that would be 2x DC3700 400GB to come close to the 1GB/s you'd get from one 10Gb/s link. Advantages: The biggest one would be a denser cluster, you can safely use x2 replication with Ceph here instead of x3. And that (1/3rd less servers) pays _easily_ for the controllers and SSDs with plenty of savings left. No failed OSDs (due to failed disks at least). This will make your administration a lot easier, just replace failed disks, no need to remove and add OSDs, no need to potentially having to reboot a storage node to get a specific SCSI ID back, etc. Currently (due to lock contention and whatever other inefficiencies in the Ceph code) the maximum IOPS I was able to achieve with a very similar setup was about 800 write IOPS per OSD. The backing hardware was not being stressed at all and hopefully in the future Ceph will be improved in that regard. See the "Slow IOPS on RBD compared to journal and backing devices" thread. So with this setup you'll be limited to about 3200 write IOPS per node for the time being. However that would already be faster than your 45 plain OSDs: 45HDDs * 100IOPS /2(journal on the same disk)=2250 Since there are only 4 OSDs per node now, you can get away with a vastly less powerful machine when it comes to CPU resources. However as the inefficiencies mentioned in the previous paragraph get ironed out it is perceivable for things to become CPU bound, so I'd put in something that's good for 8 to 16 OSDs. Disadvantages: Not many I can think off. ^o^ Maybe a higher per node read capacity, but then again with 45 drives you're saturating your network link long before that becomes an issue. > Generically, are there any tuning parameters out there that would be > good to drop in for this hardware profile and file size? > You might want to increase the default RBD object size from 4MB to something a lot bigger, but I have no experience with that. Something to test, try out yourself. Regards, Christian > We plan on growing this filesystem as we go, to 10 servers, then 15, > then 20, etc. > > Thanks a bunch for any hints!! > > cheers, > erich > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer [email protected] Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
