Hi Nils, Replies inline.
On Mon, Aug 25, 2008 at 1:19 PM, Nils Goroll <slink at schokola.de> wrote: > (I've posted my reply to the forum via web to avoid moderator delay) > > Hi Tirthankar, > > let_partition_wait is to return true if the node running it is in >> the "smaller" partition, right? >> >> Yes where the definition of "smaller" changes according to the number of >> nodes configured. i.e. if n is the number of nodes configured, a smaller >> partition may be much less than n/2. >> [...] >> In your code, the definition of "large" is fixed. In my code, the >> definition of "small" is variable. >> > > Agree. But I would prefer a closed form as the definition of what is > considered a small/large partition. The numbers you have chosen seem > arbitrary to a certain extent and I believe it would be hard to show that > exactly those numbers are a good choice. If you could come up with a formula > and some good reasoning behind it, this should be much easier to follow. > Yes to a large extent the numbers are arbitrary. But they do have a logic. They are chosen to withstand a second or a third failure and to provide maximum availability with minimal reconfiguration time. I do not have the historical data to find out what is the best case. The numbers can be tweaked in future if some one provides actual data as to what would be more optimal. As of now, for this bug fix, to find out optimal number would be out of scope. It would by itself become a project. As with any heuristic algorithm, I have tried to come close to an optimal solution, not necessarily the optimal one. > > Now we will have 3 partitions and we do not want to delay > >> the partition which has the bare minimal acceptable number of nodes, which >> differs depending on the number of nodes configured. >> > > What happens if the remaining partition does not have the minimum number of > nodes? How long will the delay be? Have you tested the scenario where a > cluster has only "small" partitions left? The delay is the max path time out. By default this is 10 seconds. Hence if no partition has the minimum number of nodes, all the partitions will wait for the path time out and then proceed. We delay all partitions as mostly we think that the small size partitions can not provide availability in case of future failures and take an optimistic approach that before thetime out expires, new nodes join the smaller partitions to form a bigger partition, big enough to go ahead without waiting for the other partitions. > > What happens if there is no larger partition, for instance if nodes >> were taken down administratively >> >> Node that is taken down for administrative function is no more a part of >> the cluster. Hence there is no issue. >> > > The documented procedure > http://docs.sun.com/app/docs/doc/819-2971/z4000076997776?l=en&a=view > is to evacuate a node for maintenance. Unconfiguring it would be too much > of burden for the admin, IMHO. Yes as the doc says "Occasionally, when patching a node with a Sun Cluster patch, you might need to temporarily remove a node from cluster membership or stop the entire cluster before installing the patch." During such a operation such a node is not considered in the cluster. I will try to put it in more simpler words. Note this is just an example. You have a 4 node cluster. you want to install a patch on one node. You just reboot the node in non cluster mode and install the patch. When the node is booted iin non cluster mode, it is not part of the running cluster. There are 2 memberships. 1. The static membership. This information is stored in the Cluster Configuration Repository (CCR). Un- congifuring or configuring the node changes the static membership/ccr. This bug fix does not deal with the static membership. The above quote that I gave, talks about dynamic membership. 2. Dynamic Membership. This is calculated at run time by a module called Cluster Membership Monitor (CMM) This bug fix is related to CMM. We are trying to decide on the dynamic cluster membership here. Hence for maintenance purposes, only the dynamic membership gets changed by booting the node out of cluster mode. > > So, unless I don't know about new functionality, there is no state > information available which marks a cluster node as "not available". In the > partitioning scenario, we must assume a node is offline if we cannot > communicate with it. > This is not a new functionality. It always existed. Though I must agree its a bit confusing to understand and differentiate between the dynamic and static membership for a cluster. > So in short, IMHO there is currently no practical way to reliably determine > the total cluster size in a partitioning situation. That is true. That is why I am applying a heuristic based solution. I am sure this is going to change in future when people really start running big clusters and we get more data. > Am I wrong? I'd be glad if I was and if I am, please help me understand. > > > If I am right, it could help to add a node property indicating whether or > not the node is available. IMHO, this would also help administrators in > handling defective hardware, test scenarios, node-local s/w issues etc. > This information is already present The "clnode status " provides the information. pocho1 @ / $ clnode status === Cluster Nodes === --- Node Status --- Node Name Status --------- ------ pocho1 Online pocho3 Online pocho4 Online pocho2 Online In this above example, the static membership is the left column which lists the node names and the dynamic membership is the right column, which says whether the node is currently in cluster membership. > > > > What is implicit > >> is that most clusters are of 4 node, hence the logic works for most of the >> cases. >> > > I disagree with using such an assumption as the basis for a particular > implementation. Might be that Sun has statistics internally about cluster > sizes deployed in the field, but as long as the product is supported for > other sizes as well, it should work well for all of them. The current implementation will work for all sizes. But yes, there could be better numbers. To find such better numbers are way out of scope for this bug fix. As I said, its a project by itself. > > This is a heuristic algorithm that I am trying to apply. Hence as any >> heuristic algorithm, it tries to solve the problem by coming very close to >> the best possible solution. >> > > I do agree with this approach, and I understand that your change will > improve a particular scenario. I am only worried that it could have negative > effects in other scenarios. As I have seen enough of those in the past, so > I'd be grateful if you could clarify my remaining questions. My change will make all scenarios work way more better. Previously this was hard coded to work on n/2 + 1 size. This changes it. A lot of changes have gone in over the last 2 years to improve the reconfiguration time. I do not see in any case that this current addition will slow down the cluster reconfiguration time than the previous implementation. If you have any particular case in mind, let me know. > > Nils > > -- Tirthankar http://insanityrulz.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20080825/081e6d70/attachment.html>
