Hi,
Christian Balzer wrote:
> For starters, make that 5 MONs.
> It won't really help you with your problem of keeping a quorum when
> loosing a DC, but being able to loose more than 1 monitor will come in
> handy.
> Note that MONs don't really need to be dedicated nodes, if you know what
> you're doing and have enough resources (most importantly fast I/O aka SSD
> for the leveldb) on another machine.
Ok, I keep that in my head.
>> In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk.
>> Journals in SSD, there are 2 SSD so 3 journals per SSD.
>> In DC2: the same config.
>>
> Out of curiosity, is that a 1U case with 8 2.5" bays, or why that
> (relatively low) density per node?
Sorry I have no idea because, in fact, it was just an example to be
concrete. So I have taken a (imaginary) server with 8 disks and 2 SSDs
(among the 8 disks, 2 for the OS in RAID1 soft). Currently, I can't be
precise about hardware because were are absolutely not fixed about the
budget (if we get it!), there are lot of uncertainties.
> 4 nodes make a pretty small cluster, if you loose a SSD or a whole node
> your cluster will get rather busy and may run out of space if you filled
> it more than 50%.
Yes indeed, it's a relevant remark. If the cluster is ~50% filled and if a
node crashes in a DC, the other node in the same DC will be 100% filled and
the cluster will be blocked. Indeed, the cluster is probably too small.
> Unless you OSDs are RAID1s, a replica of 2 is basically asking Murphy to
> "bless" you with a double disk failure. A very distinct probability with
> 24 HDDs.
The probability of a *simultaneous* disk failure in DC1 and in DC2 seems to
me relatively low. For instance, if a disk fails in DC1 and if the rebalancing
of data takes ~ 1 or 2 hours, it seems to me acceptable. But maybe I'm too
optimistic... ;)
> With OSDs backed by plain HDDs you really want a replica size of 3.
But the "2-DCs" topology isn't really suitable for a replica size of 3, no?
Is the replica size of 2 so risky?
> Normally you'd configure Ceph to NOT set OSDs out automatically if a DC
> fails (mon_osd_down_out_subtree_limit)
I didn't known this option. In the online doc, the explanations are not
clear enough for me and I'm not sure to understand its meaning. If I set:
mon_osd_down_out_subtree_limit = datacenter
what are the consequences?
- If all OSDs in DC2 are unreachable, these OSDs will not be marked out
- and if only several OSDs in DC2 are unreachable but not all in DC2,
these OSDs will be marked out.
Am I correct?
> but in the case of a prolonged DC
> outage you'll want to restore redundancy and set those OSDs out.
> Which means you will need 3 times the actual data capacity on your
> surviving 2 nodes.
> In other words, if your 24 OSDs are 2TB each you can "safely" only store
> 8TB in your cluster (48TB/3(replica)/2(DCs).
I see but my idea was just to have a long enough disaster in DC1 so that
I must restart the cluster in degraded mode in DC2, but not long enough
so that I must restore a total redundancy in DC2. Personally I didn't
consider this case and, unfortunately, I think we will never have a budget
to be able to restore a total redundancy in just one datacenter. I'm afraid
that it a unreachable whealth for us.
> Fiber isn't magical FTL (faster than light) communications and the latency
> depends (mostly) on the length (which you may or may not control) and the
> protocol used.
> A 2m long GbE link has a much worse latency than the same length in
> Infiniband.
In our case, if we can implement this infrastructure (if we have the
budget etc.), the connection would be probably 2 dark fiber with 10km
between DC1 and DC2. And we'll use Ethernet switchs with SFP transceivers
(if you have good references of switchs, I'm interested). I suppose it
could be possible to have low latencies in this case, no?
> You will of course need "enough" bandwidth, but what is going to kill
> (making it rather slow) your cluster will be the latency between those DCs.
>
> Each write will have to be acknowledged and this is where every ms less of
> latency will make a huge difference.
Yes indeed, I understand.
>> For instance, I suppose the OSD disks in DC1 (and in DC2) has
>> a throughput equal to 150 MB/s, so with 12 OSD disk in each DC,
>> I have:
>>
>> 12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps
>>
>> So, in the fiber, I need to have 14.4 Mbs. Is it correct?
>
> How do you get from 1.8 GigaByte/s to 14.4 Megabit/s?
Sorry, it was a misprint, I wanted to write 14.4 Gb/s of course. ;)
> You to multiply, not divide.
> And assuming 10 bits (not 8) for a Byte when serialized never hurts.
> So that's 18 Gb/s.
Yes, indeed. So the "naive" estimation gives 18 Gb/s (Ok for 10 bits
instead of 8).
>> Maybe is it too naive reasoning?
>
> Very much so. Your disks (even with SSD journals) will not write 150MB/s,
> because Ceph doesn't do long sequential writes (though 4MB blobs are
> better than nothing) and more importantly these writes are concurrently.
> So while one client is writing to an object at one end of your HDD another
> one may write to a very different, distant location. Seeking delays.
> With more than one client, you'd be lucky to see 50-70MB/s per HDD.
Ok, but if I follow your explanations, the throughput obtained with the
"naive" estimation is too big. In fact, I could just have:
12 x 70 = 840 MB/s ie 0.840 GB/s => 8.4 Gb/s
Correct?
>> Furthermore I have not taken into account the SSD. How evaluate the
>> needed throughput more precisely?
>>
> You need to consider the speed of the devices, their local bus (hopefully
> fast enough) and the network.
>
> All things considered you probably want a redundant link (but with
> bandwidth aggregation if both links are up).
> 10Gb/s per link would do, but 40Gb/s links (or your storage network on
> something other than Ethernet) will have less latency on top of the
> capacity for future expansion.
Ok, thanks for your help Christian.
--
François Lafont
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com