On 08/11/2014 08:26 PM, Craig Lewis wrote:
Your MON nodes are separate hardware from the OSD nodes, right?
Two nodes are OSD + MON, plus a separate MON node.
If so,
with replication=2, you should be able to shut down one of the two OSD
nodes, and everything will continue working.
IIUC, the third MON node is sufficient for a quorum if one of the OSD +
MON nodes shuts down, is that right?
Replication=2 is a little worrisome, since we've already seen two disks
simultaneously fail just in the year the cluster has been running. That
statistically unlikely situation is the first and probably last time
I'll see that, but they say lightning can strike twice....
Since it's for
experimentation, I wouldn't deal with the extra hassle of replication=4
and custom CRUSH rules to make it work. If you have your heart set on
that, it should be possible. I'm no CRUSH expert though, so I can't say
for certain until I've actually done it.
I'm a bit confused why your performance is horrible though. I'm
assuming your HDDs are 7200 RPM. With the SSD journals and
replication=3, you won't have a ton of IO, but you shouldn't have any
problem doing > 100 MB/s with 4 MB blocks. Unless your SSDs are very
low quality, the HDDs should be your bottleneck.
The below setup is tomorrow's plan; today's reality is 3 OSDs on one
node and 2 OSDs on another, crappy SSDs, 1Gb networks, pgs stuck unclean
and no monitoring to pinpoint bottlenecks. My work is cut out for me. :)
Thanks for the helpful reply. I wish we could just add a third OSD node
and have these issues just go away, but it's not in the budget ATM.
John
On Fri, Aug 8, 2014 at 10:24 PM, John Morris <[email protected]
<mailto:[email protected]>> wrote:
Our experimental Ceph cluster is performing terribly (with the
operator to blame!), and while it's down to address some issues, I'm
curious to hear advice about the following ideas.
The cluster:
- two disk nodes (6 * CPU, 16GB RAM each)
- 8 OSDs (4 each)
- 3 monitors
- 10Gb front + back networks
- 2TB Enterprise SATA drives
- HP RAID controller w/battery-backed cache
- one SSD journal drive for each two OSDs
First, I'd like to play with taking one machine down, but with the
other node continuing to serve the cluster. To maintain redundancy
in this scenario, I'm thinking of setting the pool size to 4 and the
min_size to 2, with the idea that a proper CRUSH map should always
keep two copies on each disk node. Again, *this is for
experimentation* and probably raises red flags for production, but
I'm just asking if it's *possible*: Could one node go down and the
other node continue to serve r/w data? Any anecdotes of performance
differences between size=4 and size=3 in other clusters?
Second, does it make any sense to divide the CRUSH map into an extra
level for the SSD disks, which each hold journals for two OSDs?
This might increase redundancy in case of a journal disk failure,
but ISTR something about too few OSDs in a bucket causing problems
with the CRUSH algorithm.
Thanks-
John
_________________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com