On 8/29/19 9:46 AM, Dušan Maček wrote: > Dear All, > > I need some advice regarding cluster system update. I've built a > cluster in a hope of zero downtime, but unfortunately it doesn't work > this way.
Actual "zero downtime" is unrealistic anyway, especially with COTS hard- and software. The only systems that are somewhat close to achieving no downtime at all are custom-designed hardware/software combinations that are highly specialized for the job. A cluster of general purpose hard- and software helps you keep a downtime shorter in many scenarios - e.g., hardware failure, OS crash, most software crashes, things like that. Some scenarios remain where the cluster will not be able to prevent a downtime, typically if some service stops working correctly, but its process does not terminate, so to Pacemaker's monitoring the process would seem to be running normally. Such scenarios require operator intervention, and the downtime is determined by how quickly the operators can diagnose and fix the problem. Pre-defined standard operating procedures and a highly trained 24/7 staff of system operators can help reduce downtime in those cases. To summarize: High availability is not simply a product that can be installed, it is much rather the result of using the right tools for the job, putting the right processes in place and ensuring 24/7 availability of highly trained system operators. > Now the main question : what are your experiences with system upgrades > in cluster environment ? How to avoid downtime ? The typical process for upgrading software on a cluster is to upgrade the hardware or software on the standby system, then make sure that the standby system is able to take over (e.g., is resynced with the active system), and then migrate cluster resources to the standby system, so that the other node can be upgraded. The migration itself however causes a short downtime, because most services will need to be stopped on the active node and restarted on the standby node. That's what most people do, and in our experience, it works alright in most cases. You will however lose high availability during the upgrade process, because your standby system will not be ready to run services while you are upgrading it. If high availability should even be maintained while upgrading, then at least a 3 node cluster is required, provided that your software is downwards compatible as well as upwards compatible. Once you begin running the upgraded software, your only standby nodes still have the old software, so if your upgraded node fails before you had a chance to upgrade another node, you have another downtime if your software is not upwards-compatible. To prevent that scenario too, you would at least need a 4 node cluster, so that you can have 2 nodes providing HA services while you are upgrading the other 2 nodes, and then you could fail over to 2 upgraded nodes that could provide HA services using upgraded software. Ideally, the cluster would also be able to rely on a quorum for decision-making at all times, so a cluster of at least 5 nodes would be even better. This is about as close to zero downtime as you can get with COTS hard- and software: - Run a 5 node cluster - Make use of quorum - Have properly working fencing mechanisms - Upgrade only one node at a time - Upgrade two (or three) nodes before failing over to a node with upgraded software - Then upgrade the remaining nodes one by one until all nodes are upgraded There will still be a short downtime during the failover. That being said, the biggest risk for downtime is probably operator error at this point, due to the high complexity of such an upgrade process (e.g., managing the cluster correctly during the upgrade process). br, Robert _______________________________________________ Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list [email protected] https://lists.linbit.com/mailman/listinfo/drbd-user
