Hi,
We are thinking about a ceph infrastructure and I have questions.
Here is the conceived (but not yet implemented) infrastructure:
(please, be careful to read the schema with a monospace font ;))
+---------+
| users |
|(browser)|
+----+----+
|
|
+----+----+
| |
+----------+ WAN +------------+
| | | |
| +---------+ |
| |
| |
+-----+-----+ +-----+-----+
| | | |
| monitor-1 | | monitor-3 |
| monitor-2 | | |
| | Fiber connection | |
| +---------------------+ |
| OSD-1 | | OSD-13 |
| OSD-2 | | OSD-14 |
| ... | | ... |
| OSD-12 | | OSD-24 |
| | | |
| client-a1 | | client-a2 |
| client-b1 | | client-b2 |
| | | |
+-----------+ +-----------+
Datacenter1 Datacenter2
(DC1) (DC2)
In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk.
Journals in SSD, there are 2 SSD so 3 journals per SSD.
In DC2: the same config.
You can imagine for instance that:
- client-a1 and client-a2 are radosgw
- client-b1 and client-b2 are web servers which use the Cephfs of the cluster.
And of course, the principle is to have data dispatched in DC1 and
DC2 (size == 2, one copy of the object in DC1, the other in DC2).
1. If I suppose that the latency between DC1 and DC2 (via the fiber
connection) is ok, I would like to know which throughput do I need to
avoid network bottleneck? Is there a rule to compute the needed
throughput? I suppose it depends on the disk throughputs?
For instance, I suppose the OSD disks in DC1 (and in DC2) has
a throughput equal to 150 MB/s, so with 12 OSD disk in each DC,
I have:
12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps
So, in the fiber, I need to have 14.4 Mbs. Is it correct? Maybe is it
too naive reasoning?
Furthermore I have not taken into account the SSD. How evaluate the
needed throughput more precisely?
2. I'm thinking about disaster recoveries too. For instance, if there
is a disaster in DC2, DC1 will work (fine). But if there is a disaster
in DC1, DC2 will not work (no quorum).
But now, I suppose there is a long and big disaster in DC1. So I suppose
DC1 is totally unreachable. In this case, I want to start (manually) my
ceph cluster in DC2. No problem with that, I have seen explanations in the
documentation to do that:
- I stop monitor-3
- I extract the monmap
- I remove monitor-1 and monitor-2 from this monmap
- I inject the new monmap in monitor-3
- I restart monitor-3
After that, I have a DC1 unreachable but DC2 is working (with only one monitor).
But what happens if DC1 becomes again reachable? What will the behavior of
monitor-1 and monitor-2 in this case? Do monitor-1 and monitor-2 understand
that they belong no longer to the ceph cluster?
And now I imagine the worst scenario: DC1 becomes again reachable but the
switch in DC1 which is connected on the fiber is very long to restart so
that, during a short period, DC1 is reachable but the connection with DC2
is not yet operational. What happens in this period? client-a1 and client-b1
could write data in the cluster in this case, right? And the data in the
cluster could be compromised because DC1 in not aware of writings in DC2.
Am I wrong?
My conclusion about that is: in case of long disaster in DC1, I can restart
the ceph cluster in DC2 with the method described above (removing monitor-1
and monitor-2 from the monmap in monitor-3 etc.) but *only* *if* I can
definitively stop monitor-1 and monitor-2 in DC1 before (and if I can't, I
do nothing and I wait). Is it correct?
Thanks in advance for your explanations.
--
François Lafont
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com