Hi Nils,

Replies inline.

On Mon, Aug 25, 2008 at 1:19 PM, Nils Goroll <slink at schokola.de> wrote:

> (I've posted my reply to the forum via web to avoid moderator delay)
>
> Hi Tirthankar,
>
>     let_partition_wait is to return true if the node running it is in
>>    the "smaller" partition, right?
>>
>> Yes where the definition of "smaller" changes according to the number of
>> nodes configured. i.e. if n is the number of nodes configured, a smaller
>> partition may be much less than n/2.
>> [...]
>> In your code, the definition of "large" is fixed. In my code, the
>> definition of "small" is variable.
>>
>
> Agree. But I would prefer a closed form as the definition of what is
> considered a small/large partition. The numbers you have chosen seem
> arbitrary to a certain extent and I believe it would be hard to show that
> exactly those numbers are a good choice. If you could come up with a formula
> and some good reasoning behind it, this should be much easier to follow.
>

Yes to a large extent the numbers are arbitrary. But they do have a logic.
They are chosen to withstand a second or a third failure and to provide
maximum availability with minimal reconfiguration time. I do not have the
historical data to find out what is the best case. The numbers can be
tweaked in future if some one provides actual data as to what would be more
optimal. As of now, for this bug fix, to find out optimal number would be
out of scope. It would by itself become a project.

As with any heuristic algorithm, I have tried to come close to an optimal
solution, not necessarily the optimal one.


> > Now we will have 3 partitions and we do not want to delay
>
>> the partition which has the bare minimal acceptable number of nodes, which
>> differs depending on the number of nodes configured.
>>
>
> What happens if the remaining partition does not have the minimum number of
> nodes? How long will the delay be? Have you tested the scenario where a
> cluster has only "small" partitions left?


The delay is the max path time out. By default this is 10 seconds. Hence if
no partition has the minimum number of nodes, all the partitions will wait
for the path time out and then proceed. We delay all partitions as mostly we
think that the small size partitions can not provide availability in case of
future failures and take an optimistic approach that before thetime out
expires, new nodes join the smaller partitions to form a bigger partition,
big enough to go ahead without waiting for the other partitions.



>
>     What happens if there is no larger partition, for instance if nodes
>>    were taken down administratively
>>
>> Node that is taken down for administrative function is no more a part of
>> the cluster. Hence there is no issue.
>>
>
> The documented procedure
> http://docs.sun.com/app/docs/doc/819-2971/z4000076997776?l=en&a=view
> is to evacuate a node for maintenance. Unconfiguring it would be too much
> of burden for the admin, IMHO.


Yes as the doc says
"Occasionally, when patching a node with a Sun Cluster patch, you might need
to temporarily remove a node from cluster membership or stop the entire
cluster before installing the patch."

During such a operation such a node is not considered in the cluster.

I will try to put it in more simpler words. Note this is just an example.
You have a 4 node cluster. you want to install a patch on one node. You just
reboot the node in non cluster mode and install the patch. When the node is
booted iin non cluster mode, it is not part of the running cluster.

There are 2 memberships.
1. The static membership. This information is stored in the Cluster
Configuration Repository (CCR). Un- congifuring or configuring the node
changes the static membership/ccr. This bug fix does not deal with the
static membership. The above quote that I gave, talks about dynamic
membership.

2. Dynamic Membership.
This is calculated at run time by a module called Cluster Membership Monitor
(CMM) This bug fix is related to CMM. We are trying to decide on the dynamic
cluster membership here. Hence for maintenance purposes, only the dynamic
membership gets changed by booting the node out of cluster mode.


>
> So, unless I don't know about new functionality, there is no state
> information available which marks a cluster node as "not available". In the
> partitioning scenario, we must assume a node is offline if we cannot
> communicate with it.
>

This is not a new functionality. It always existed. Though I must agree its
a bit confusing to understand and differentiate between the dynamic and
static membership for a  cluster.


> So in short, IMHO there is currently no practical way to reliably determine
> the total cluster size in a partitioning situation.


That is true. That is why I am applying a heuristic based solution.  I am
sure this is going to change in future when people really start running big
clusters and we get more data.


> Am I wrong? I'd be glad if I was and if I am, please help me understand.
>




>
> If I am right, it could help to add a node property indicating whether or
> not the node is available. IMHO, this would also help administrators in
> handling defective hardware, test scenarios, node-local s/w issues etc.
>

This information is already present
The "clnode status "  provides the information.

pocho1 @ / $ clnode status

=== Cluster Nodes ===

--- Node Status ---

Node Name                                       Status
---------                                       ------
pocho1                                          Online
pocho3                                          Online
pocho4                                          Online
pocho2                                          Online


In this above example, the static membership is the left column which lists
the node names and the dynamic membership is the right column, which says
whether the node is currently in cluster membership.



>
>
> > What is implicit
>
>> is that most clusters are of 4 node, hence the logic works for most of the
>> cases.
>>
>
> I disagree with using such an assumption as the basis for a particular
> implementation. Might be that Sun has statistics internally about cluster
> sizes deployed in the field, but as long as the product is supported for
> other sizes as well, it should work well for all of them.


The current implementation will work for all sizes. But yes, there could be
better numbers. To find such better numbers are way out of scope for this
bug fix. As I said, its a project by itself.


>
>  This is a heuristic algorithm that I am trying to apply. Hence as any
>> heuristic algorithm, it tries to solve the problem by coming very close to
>> the best possible solution.
>>
>
> I do agree with this approach, and I understand that your change will
> improve a particular scenario. I am only worried that it could have negative
> effects in other scenarios. As I have seen enough of those in the past, so
> I'd be grateful if you could clarify my remaining questions.


My change will make all scenarios work way more better. Previously this was
hard coded to work on n/2 + 1 size. This changes it. A lot of changes have
gone in over the last 2  years to improve the reconfiguration time. I do not
see in any case that this current addition will slow down the cluster
reconfiguration time than the previous implementation. If you have any
particular case in mind, let me know.


>
> Nils
>
>


-- 

Tirthankar
http://insanityrulz.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20080825/081e6d70/attachment.html>

Reply via email to