Hello,

On Fri, 27 Jun 2014 11:00:58 -0700 Erich Weiler wrote:

> Hi Folks,
> 
> We're going to spin up a ceph cluster with the following general specs:
> 
> * Six 10Gb/s connected servers, each with 45 4TB disks in a JBOD
> 
Interesting amount of disks, what case/server is this?

> * Each disk is an OSD, so 45 OSDs per server
> 
> * So 45*6 = 270 OSDs total
>
As Udo pointed out, that's quite a large number of OSDs, which would need
to backed up by adequate CPU and RAM resources (as much as you can afford
of the later for read caching).

And like Udo I'd recommend running some kind of monitoring system that
can set the cluster to no-out if a node goes down unexpectedly, in nearly
all cases that will be the better cause of action than letting Ceph
redistribute dozens of TB only to have them redistributed again once the
node recovers.  

> * Three separate, dedicated monitor nodes
> 
Nothing wrong with that, doesn't need that much in terms of resources, a
FAST CPU with just a few cores and storage on SSDs for the leveldb will do
the trick.

> The files stored on this storage cluster will be large file, each file 
> will be several GB in size at the minimum, with some files being over
> 100GB.
> 
Now this is where it gets interesting. 
How are you storing/accessing these files? 
CephFS? 
RBD volumes mounted into hosts (kernelspace)? 
RBD volumes for VMs (userspace)?

What kind of access pattern are you expecting? 
Just looking at these sizes something like a virtual tape library
for backups comes to mind, followed by scientific data (given your
domain). ^o^

If you're expecting mostly large reads/writes that are also more or less
sequential you're not going to be starved for IOPS and given that you're
not planning on using SSD backed journals hints that way, too.

In case my assumption up there is correct, consider this alternative:

Put 2 good (I like Areca 1882, but whatever cranks your tractor) RAID
controllers with lots of HW cache in there. 
Create 4x 10disk RAID6, thus 4 OSDs.
Use the rest of the drive slots for hot spares and 2-4 for SSDs to hold
the journals. Use FAST, Intel DC SSDs, preferably on another controller if
possible. You want the combined SSDs to be fast enough to handle your
network bandwidth, so that would be 2x DC3700 400GB to come close to the
1GB/s you'd get from one 10Gb/s link.

Advantages:

The biggest one would be a denser cluster, you can safely use x2
replication with Ceph here instead of x3. And that (1/3rd less servers)
pays _easily_ for the controllers and SSDs with plenty of savings left.

No failed OSDs (due to failed disks at least). This will make your
administration a lot easier, just replace failed disks, no need to remove
and add OSDs, no need to potentially having to reboot a storage node to
get a specific SCSI ID back, etc.

Currently (due to lock contention and whatever other inefficiencies in the
Ceph code) the maximum IOPS I was able to achieve with a very similar setup
was about 800 write IOPS per OSD. The backing hardware was not being
stressed at all and hopefully in the future Ceph will be improved in that
regard.
See the "Slow IOPS on RBD compared to journal and backing devices" thread.
So with this setup you'll be limited to about 3200 write IOPS per node for
the time being.
However that would already be faster than your 45 plain OSDs:
45HDDs * 100IOPS /2(journal on the same disk)=2250

Since there are only 4 OSDs per node now, you can get away with a vastly
less powerful machine when it comes to CPU resources. However as the
inefficiencies mentioned in the previous paragraph get ironed out it is
perceivable for things to become CPU bound, so I'd put in something that's
good for 8 to 16 OSDs. 

Disadvantages:
Not many I can think off. ^o^ 
Maybe a higher per node read capacity, but then again with 45 drives
you're saturating your network link long before that becomes an issue.

> Generically, are there any tuning parameters out there that would be 
> good to drop in for this hardware profile and file size?
> 
You might want to increase the default RBD object size from 4MB to
something a lot bigger, but I have no experience with that. 
Something to test, try out yourself.

Regards,

Christian
> We plan on growing this filesystem as we go, to 10 servers, then 15, 
> then 20, etc.
> 
> Thanks a bunch for any hints!!
> 
> cheers,
> erich
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
[email protected]           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to