There are multiple ways to accomplish active-active two-side synchronous DR, aka "stretch cluster".
The most common approach is to have 3 sites: two main sites A and B, plus tiebreaker site C. The two main sites host all data/metadata disks and each has some even number of quorum nodes. There's no stretched SAN, each site has its own set of NSDs defined. The tiebreaker site consists of a single quorum node with a small descOnly LUN. In this config, any of the 3 sites can do down or be disconnected from the rest without affecting the other two. The tiebreaker site is essential: it provides a quorum node for node majority quorum to function, and a descOnly disk for the file system descriptor quorum. Technically speaking, one do away with the need to have a quorum node at site C by using "minority quorum", i.e. tiebreaker disks, but this model is more complex and it is harder to predict its behavior under various failure conditions. The basic problem with the minority quorum is that it allows a minority of nodes to win in a network partition scenario, just like the name implies. In the extreme case this leads to the "dictator problem", when a single partitioned node could manage to win the disk election and thus kick everyone else out. And since a tiebreaker disk needs to be visible from all quorum nodes, you do need a stretched SAN that extends between sites. The classic active-active stretch cluster only requires a good TCP/IP network. The question that gets asked a lot is "how good should be network connection between sites be". There's no simple answer, unfortunately. It would be completely impractical to try to frame this in simple thresholds. The worse the network connection is, the more pain it produces, but everyone has a different level of pain tolerance. And everyone's workload is different. In any GPFS configuration that uses data replication, writes are impacted far more by replication than reads. So a read-mostly workload may run fine with a dodgy inter-site link, while a write-heavy workload may just run into the ground, as IOs may be submitted faster than they could be completed. The buffering model could make a big difference. An application that does a fair amount of write bursts, with those writes being buffered in a generously sized pagepool, may perform acceptably, while a different application that uses O_SYNC or O_DIRECT semantics for writes may run a lot worse, all other things being equal. As long as all nodes can renew their disk leases within the configured disk lease interval (35 sec by default), GPFS will basically work, so the absolute threshold for the network link quality is not particularly stringent, but beyond that it all depends on your workload and your level of pain tolerance. Practically speaking, you want a network link with low-double-digits RTT at worst, almost no packet loss, and bandwidth commensurate with your application IO needs (fudged some to allow for write amplification -- another factor that's entirely workload-dependent). So a link with, say, 100ms RTT and 2% packet loss is not going to be usable to almost anyone, in my opinion, a link with 30ms RTT and 0.1% packet loss may work for some undemanding read-mostly workloads, and so on. So you pretty much have to try it out to see. The disk configuration is another tricky angle. The simplest approach is to have two groups of data/metadata NSDs, on sites A and B, and not have any sort of SAN reaching across sites. Historically, such a config was actually preferred over a stretched SAN, because it allowed for a basic site topology definition. When multiple replicas of the same logical block are present, it is obviously better/faster to read the replica that resides on a disk that's local to a given site. This is conceptually simple, but how would GPFS know what a site is and what disks are local vs remote? To GPFS, all disks are equal. Historically, the readReplicaPolicy=local config parameter was put forward to work around the problem. The basic idea was: if the reader node is on the same subnet as the primary NSD server for a given replica, this replica is "local", and is thus preferred. This sort of works, but requires a very specific network configuration, which isn't always practical. Starting with GPFS 4.1.1, GPFS implements readReplicaPolicy=fastest, where the best replica for reads is picked based on observed disk IO latency. This is more general and works for all disk topologies, including a stretched SAN. yuri From: "[email protected]" <[email protected]> To: gpfsug main discussion list <[email protected]>, Date: 07/21/2016 05:45 AM Subject: Re: [gpfsug-discuss] NDS in Two Site scenario Sent by: [email protected] This is where my confusion sits. So if I have two sites, and two NDS Nodes per site with 1 NSD (to keep it simple), do I just present the physical LUN in Site1 to Site1 NDS Nodes and physical LUN in Site2 to Site2 NSD Nodes? Or is it that I present physical LUN in Site1 to all 4 NDS Nodes and the same at Site2? (Assuming SAN and not direct attached in this case). I know I’m being persistent but this for some reason confuses me. Site1 NSD Node1 ---NSD1 ---Physical LUN1 from SAN1 NSD Node2 Site2 NSD Node3 ---NSD2 –Physical LUN2 from SAN2 NSD Node4 Or Site1 NSD Node1 ----NSD1 –Physical LUN1 from SAN1 ----NSD2 –Physical LUN2 from SAN2 NSD Node2 Site 2 NSD Node3 ---NSD2 – Physical LUN2 from SAN2 ---NSD1 --Physical LUN1 from SAN1 NSD Node4 Site 3 Node5 Quorum From: <[email protected]> on behalf of Ken Hill <[email protected]> Reply-To: gpfsug main discussion list <[email protected]> Date: Wednesday, July 20, 2016 at 7:02 PM To: gpfsug main discussion list <[email protected]> Subject: Re: [gpfsug-discuss] NDS in Two Site scenario Yes - it is a cluster. The sites should NOT be further than a MAN - or Campus network. If you're looking to do this over a large distance - it would be best to choose another GPFS solution (Multi-Cluster, AFM, etc). Regards, Ken Hill Technical Sales Specialist | Software Defined Solution Sales IBM Systems Phone:1-540-207-7270 E-mail: [email protected] 2300 Dulles Station Blvd Herndon, VA 20171-6133 United States From: "[email protected]" <[email protected]> To: gpfsug main discussion list <[email protected]> Date: 07/20/2016 07:33 PM Subject: Re: [gpfsug-discuss] NDS in Two Site scenario Sent by: [email protected] So in this scenario Ken, can server3 see any disks in site1? From: <[email protected]> on behalf of Ken Hill <[email protected]> Reply-To: gpfsug main discussion list <[email protected]> Date: Wednesday, July 20, 2016 at 4:15 PM To: gpfsug main discussion list <[email protected]> Subject: Re: [gpfsug-discuss] NDS in Two Site scenario Site1 Site2 Server1 (quorum 1) Server3 (quorum 2) Server2 Server4 SiteX Server5 (quorum 3) You need to set up another site (or server) that is at least power isolated (if not completely infrastructure isolated) from Site1 or Site2. You would then set up a quorum node at that site | location. This insures you can still access your data even if one of your sites go down. You can further isolate failure by increasing quorum (odd numbers). The way quorum works is: The majority of the quorum nodes need to be up to survive an outage. - With 3 quorum nodes you can have 1 quorum node failures and continue filesystem operations. - With 5 quorum nodes you can have 2 quorum node failures and continue filesystem operations. - With 7 quorum nodes you can have 3 quorum node failures and continue filesystem operations. - etc Please see http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/ibmspectrumscale42_content.html?view=kc for more information about quorum and tiebreaker disks. Ken Hill Technical Sales Specialist | Software Defined Solution Sales IBM Systems Phone:1-540-207-7270 E-mail: [email protected] 2300 Dulles Station Blvd Herndon, VA 20171-6133 United States From: "[email protected]" <[email protected]> To: gpfsug main discussion list <[email protected]> Date: 07/20/2016 04:47 PM Subject: [gpfsug-discuss] NDS in Two Site scenario Sent by: [email protected] For some reason this concept is a round peg that doesn’t fit the square hole inside my brain. Can someone please explain the best practice to setting up two sites same cluster? I get that I would likely have two NDS nodes in site 1 and two NDS nodes in site two. What I don’t understand are the failure scenarios and what would happen if I lose one or worse a whole site goes down. Do I solve this by having scale replication set to 2 for all my files? I mean a single site I think I get it’s when there are two datacenters and I don’t want two clusters typically. Mark R. Bush| Solutions Architect Mobile: 210.237.8415 | [email protected] Sirius Computer Solutions | www.siriuscom.com 10100 Reunion Place, Suite 500, San Antonio, TX 78216 This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
