Hi Christian, Thank you very much for the reply. Please find my comments in-line.
Thanks & Regards, Manoj On Sun, Aug 7, 2016 at 3:26 PM, Christian Balzer <[email protected]> wrote: > > [Reduced to ceph-users, this isn't community related] > > Hello, > > On Sat, 6 Aug 2016 20:23:41 +0530 Venkata Manojawa Paritala wrote: > > > Hi, > > > > We have configured single Ceph cluster in a lab with the below > > specification. > > > > 1. Divided the cluster into 3 logical sites (SiteA, SiteB & SiteC). This > is > > to simulate that nodes are part of different Data Centers and having > > network connectivity between them for DR. > > You might want to search the ML archives, this has been discussed plenty > of times. > While DR and multi-site replication certainly is desirable, it is also > going to introduce painful latencies with Ceph, especially if your sites > aren't relatively close to each other (Metro, less than 10km fiber runs). > Manoj :- We have configured the delays on the ethernet ports. Between sites A & B we have a 0.2 ms delay (configured on SiteB). Between Sites B & C we have a delay of 5ms (configured on siteC). > > The new rbd-mirror feature may or may not help in this kind of scenario, > see the posts about this just in the last few days. > > Since you didn't explicitly mentioned it, you do have custom CRUSH rules > to distribute your data accordingly? > Manoj :- You guessed it right. We have configured rulesets in such a way that OSDs from all the 3 sites are picked up for replication. > > > 2. Each site operates in a different subnet and each subnet is part of > one > > VLAN. We have configured routing so that OSD nodes in one site can > > communicate to OSD nodes in the other 2 sites. > > 3. Each site will have one monitor node, 2 OSD nodes (to which we have > > disks attached) and IO generating clients. > > You will want more monitors in a production environment and depending on > the actual topology more "sites" to break ties. > > For example if you have triangle setup, give your primary site 3 MONs > and the other sites 2 MONs each. > > Of course this means if you loose all network links between your sites, > you still won't be able to reach quorum. > Manoj :- Ok. > > > 4. We have configured 2 networks. > > 4.1. Public network - To which all the clients, monitors and OSD nodes > are > > connected > > 4.2. Cluster network - To which only the OSD nodes are connected for - > > Replication/recovery/hearbeat traffic. > > > Unless actually needed, I (and others) tend to avoid split networks, since > it can introduce "wonderful" failure scenarios, as you just found out. > > The only reason for such a split network setup in my book is if your > storage nodes can write FASTER than the aggregate bandwidth of your > network links to those nodes. > Manoj :- We did not wanted the replication / recovery / heart beat traffic on the public network, so we configured a separate network for them. > > > 5. We have 2 issues here. > > 5.1. We are unable sustain IO for clients from individual sites when we > > isolate the OSD nodes by bringing down ONLY the cluster network between > > sites. Logically this will make the individual sites to be in isolation > > with respect to the cluster network. Please note that the public network > is > > still connected between the sites. > > > See above, that's expected. > Though in a real world setup I'd expect both networks to fail (common fiber > trunk being severed) at the same time. > > Again, instead of 2 networks you'll be better off with as single, but > fully redundant network. > Manoj :- You mean to say only one network (public) with 2 NICs on each of the Monitor & OSD nodes? > > > 5.2. In a fully functional cluster, when we bring down 2 sites (shutdown > > the OSD services of 2 sites - say Site A OSDs and Site B OSDs) then, OSDs > > in the third site (Site C) are going down (OSD Flapping). > > > > This is a bit unclear, if you only shut down the OSDs and MONs are still > running and have connectivity the cluster should have a working > quorum still (the thing you're thinking about below). > > OTOH, loosing 2/3rd of your OSDs with normal (min_size=2) replication > settings will lock your cluster up anyway. > Manoj :- This was what we were guessing. Also, we observed the same issue when we tried with single replica also. > > Regards, > > Christian > > > We need workarounds/solutions to fix the above 2 issues. > > > > Below are some of the parameters we have already mentioned in the > Cenf.conf > > to sustain the cluster for a longer time, when we cut-off the links > between > > sites. But, they were not successful. > > > > -------------- > > [global] > > public_network = 10.10.0.0/16 > > cluster_network = 192.168.100.0/16,192.168.150.0/16,192.168.200.0/16 > > osd hearbeat address = 172.16.0.0/16 > > > > [monitor] > > mon osd report timeout = 1800 > > > > [OSD} > > osd heartbeat interval = 12 > > osd hearbeat grace = 60 > > osd mon heartbeat interval = 60 > > osd mon report interval max = 300 > > osd mon report interval min = 10 > > osd mon act timeout = 60 > > . > > . > > ---------------- > > > > We also confiured the parameter "osd_heartbeat_addr" and tried with the > > values - 1) Ceph public network (assuming that when we bring down the > > cluster network hearbeat should happen via public network). 2) Provided a > > different network range altogether and had physical connections. But both > > the options did not work. > > > > We have a total of 49 OSDs (14 in Site A, 14 in SiteB, 21 in SiteC) in > the > > cluster. One Monitor in each Site. > > > > We need to try the below two options. > > > > A) Increase the "mon osd min down reporters" value. Question is how much. > > Say, if I give this value to 49, then will the client IO sustain when we > > cut-off the cluster network links between sites. In this case one issue > > would be that if the OSD is really down we wouldn't know. > > > > B) Add 2 monitors to each site. This would make each site with 3 monitors > > and the overall cluster will have 9 monitors. The reason we wanted to try > > this is, we think that the OSDs are going down as the the quorum is > unable > > to find the minimum number nodes (may be monitors) to sustain. > > > > Thanks & Regards, > > Manoj > > > -- > Christian Balzer Network/Systems Engineer > [email protected] Global OnLine Japan/Rakuten Communications > http://www.gol.com/ >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
