Re: [ceph-users] OSDs going down when we bring down some OSD nodes Or cut-off the cluster network link between OSD nodes

Venkata Manojawa Paritala Mon, 08 Aug 2016 03:02:25 -0700

Hi Christian,

Thank you very much for the reply. Please find my comments in-line.


Thanks & Regards,
Manoj

On Sun, Aug 7, 2016 at 3:26 PM, Christian Balzer <[email protected]> wrote:

>
> [Reduced to ceph-users, this isn't community related]
>
> Hello,
>
> On Sat, 6 Aug 2016 20:23:41 +0530 Venkata Manojawa Paritala wrote:
>
> > Hi,
> >
> > We have configured single Ceph cluster in a lab with the below
> > specification.
> >
> > 1. Divided the cluster into 3 logical sites (SiteA, SiteB & SiteC). This
> is
> > to simulate that nodes are part of different Data Centers and having
> > network connectivity between them for DR.
>
> You might want to search the ML archives, this has been discussed plenty
> of times.
> While DR and multi-site replication certainly is desirable, it is also
> going to introduce painful latencies with Ceph, especially if your sites
> aren't relatively close to each other (Metro, less than 10km fiber runs).
>

Manoj :- We have configured the delays on the ethernet ports. Between sites
A & B we have a 0.2 ms delay (configured on SiteB). Between Sites B & C we
have a delay of 5ms (configured on siteC).

>
> The new rbd-mirror feature may or may not help in this kind of scenario,
> see the posts about this just in the last few days.
>
> Since you didn't explicitly mentioned it, you do have custom CRUSH rules
> to distribute your data accordingly?
>

Manoj :- You guessed it right. We have configured rulesets in such a way
that OSDs from all the 3 sites are picked up for replication.

>
> > 2. Each site operates in a different subnet and each subnet is part of
> one
> > VLAN. We have configured routing so that OSD nodes in one site can
> > communicate to OSD nodes in the other 2 sites.
> > 3. Each site will have one monitor  node, 2  OSD nodes (to which we have
> > disks attached) and IO generating clients.
>
> You will want more monitors in a production environment and depending on
> the actual topology more "sites" to break ties.
>
> For example if you have triangle setup, give your primary site 3 MONs
> and the other sites 2 MONs each.
>
> Of course this means if you loose all network links between your sites,
> you still won't be able to reach quorum.
>

Manoj :- Ok.

>
> > 4. We have configured 2 networks.
> > 4.1. Public network - To which all the clients, monitors and OSD nodes
> are
> > connected
> > 4.2. Cluster network - To which only the OSD nodes are connected for -
> > Replication/recovery/hearbeat traffic.
> >
> Unless actually needed, I (and others) tend to avoid split networks, since
> it can introduce "wonderful" failure scenarios, as you just found out.
>
> The only reason for such a split network setup in my book is if your
> storage nodes can write FASTER than the aggregate bandwidth of your
> network links to those nodes.
>

Manoj :- We did not wanted the replication / recovery / heart beat traffic
on the public network, so we configured a separate network for them.

>
> > 5. We have 2 issues here.
> > 5.1. We are unable sustain IO for clients from individual sites when we
> > isolate the OSD nodes by bringing down ONLY the cluster network between
> > sites. Logically this will make the individual sites to be in isolation
> > with respect to the cluster network. Please note that the public network
> is
> > still connected between the sites.
> >
> See above, that's expected.
> Though in a real world setup I'd expect both networks to fail (common fiber
> trunk being severed) at the same time.
>
> Again, instead of 2 networks you'll be better off with as single, but
> fully redundant network.
>

Manoj :- You mean to say only one network (public) with 2 NICs on each of
the Monitor & OSD nodes?

>
> > 5.2. In a fully functional cluster, when we bring down 2 sites (shutdown
> > the OSD services of 2 sites - say Site A OSDs and Site B OSDs) then, OSDs
> > in the third site (Site C) are going down (OSD Flapping).
> >
>
> This is a bit unclear, if you only shut down the OSDs and MONs are still
> running and have connectivity the cluster should have a working
> quorum still (the thing you're thinking about below).
>
> OTOH, loosing 2/3rd of your OSDs with normal (min_size=2) replication
> settings will lock your cluster up anyway.
>

Manoj :-  This was what we were guessing. Also, we observed the same issue
when we tried with single replica also.

>
> Regards,
>
> Christian
>
> > We need workarounds/solutions to  fix the above 2 issues.
> >
> > Below are some of the parameters we have already mentioned in the
> Cenf.conf
> > to sustain the cluster for a longer time, when we cut-off the links
> between
> > sites. But, they were not successful.
> >
> > --------------
> > [global]
> > public_network = 10.10.0.0/16
> > cluster_network = 192.168.100.0/16,192.168.150.0/16,192.168.200.0/16
> > osd hearbeat address = 172.16.0.0/16
> >
> > [monitor]
> > mon osd report timeout = 1800
> >
> > [OSD}
> > osd heartbeat interval = 12
> > osd hearbeat grace = 60
> > osd mon heartbeat interval = 60
> > osd mon report interval max = 300
> > osd mon report interval min = 10
> > osd mon act timeout = 60
> > .
> > .
> > ----------------
> >
> > We also confiured the parameter "osd_heartbeat_addr" and tried with the
> > values - 1) Ceph public network (assuming that when we bring down the
> > cluster network hearbeat should happen via public network). 2) Provided a
> > different network range altogether and had physical connections. But both
> > the options did not work.
> >
> > We have a total of 49 OSDs (14 in Site A, 14 in SiteB, 21 in SiteC) in
> the
> > cluster. One Monitor in each Site.
> >
> > We need to try the below two options.
> >
> > A) Increase the "mon osd min down reporters" value. Question is how much.
> > Say, if I give this value to 49, then will the client IO sustain when we
> > cut-off the cluster network links between sites. In this case one issue
> > would be that if the OSD is really down we wouldn't know.
> >
> > B) Add 2 monitors to each site. This would make each site with 3 monitors
> > and the overall cluster will have 9 monitors. The reason we wanted to try
> > this is, we think that the OSDs are going down as the the quorum is
> unable
> > to find the minimum number nodes (may be monitors) to sustain.
> >
> > Thanks & Regards,
> > Manoj
>
>
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSDs going down when we bring down some OSD nodes Or cut-off the cluster network link between OSD nodes

Reply via email to