hi,

    I am just porting myself to gfs , so looking at some codes of openais
related to gfs ;

    there are two entries to get into COMMIT state ;

message_handler_memb_commit_token   --> COMMIT ;
memb_join_process
{...
                if (memb_consensus_agreed (instance) &&
                        memb_lowest_in_config (instance)) {

                        memb_state_commit_token_create (instance,
my_commit_token);

                        memb_state_commit_enter (instance, my_commit_token);
                }
...
}

if memb_consensus_agreed don't return true , no one can get into commit
state ; so what configuration are they agreed ?

>>May 19 17:14:53 node02 openais[11281]: [TOTEM] entering COMMIT state.
>>May 19 17:14:58 node02 openais[11281]: [TOTEM] The token was lost in the
COMMIT state

token is lost at every timeout(5s) , if you can find where token stop at
ring? what are the members in the ring ?

>>May 19 17:15:03 node10 openais[20326]: [TOTEM] The consensus timeout
expired
if node10 has different membership to make consensus ? different from the
membership of commit token sender ? and what they are ,

if you have more info about above question ?

2010/5/20 Dejan Muhamedagic <[email protected]>

> Hi,
>
> We have a strange case of openais (whitetank) failing in a
> 12-node cluster. After stopping openais on node 11, the other
> nodes recover and form a 11-node cluster without any problems.
> However, after stopping openais on the node 7, the cluster falls
> apart, there are ten 1-node partitions.
>
> No node enters the RECOVERY state. There are numerous messages
> about tokens lost in the COMMIT state. All nodes just loop
> without making any progress.
>
> It is interesting that the nodes behave in two different ways:
> the first group consisting of nodes 1-6 and the second of nodes
> 8-10,12. Nodes in the first group send tokens to the multicast
> address once every token timeout (5s), then report that the
> token was lost, then reenter the COMMIT state:
>
> May 19 17:14:46 node02 openais[11281]: [TOTEM] entering GATHER state from
> 12
> May 19 17:14:52 node02 openais[11281]: [TOTEM] entering GATHER state from
> 11
> May 19 17:14:53 node02 openais[11281]: [TOTEM] Saving state aru bf high seq
> received bf
> May 19 17:14:53 node02 openais[11281]: [TOTEM] Storing new sequence id for
> ring 374e00
> May 19 17:14:53 node02 openais[11281]: [TOTEM] entering COMMIT state.
> May 19 17:14:58 node02 openais[11281]: [TOTEM] The token was lost in the
> COMMIT state.
> May 19 17:14:58 node02 openais[11281]: [TOTEM] entering GATHER state from
> 4.
> May 19 17:14:58 node02 openais[11281]: [TOTEM] Storing new sequence id for
> ring 374e04
> May 19 17:14:58 node02 openais[11281]: [TOTEM] entering COMMIT state.
> May 19 17:15:03 node02 openais[11281]: [TOTEM] The token was lost in the
> COMMIT state.
> May 19 17:15:03 node02 openais[11281]: [TOTEM] entering GATHER state from
> 4.
> ...
>
> Nodes of the second group send tokens more often (once a second),
> enter only the GATHER state, then complain about the consensus
> timeout expired once every 11s.
>
> May 19 17:14:46 node10 openais[20326]: [TOTEM] entering GATHER state from
> 12
> May 19 17:14:52 node10 openais[20326]: [TOTEM] entering GATHER state from
> 11
> May 19 17:15:03 node10 openais[20326]: [TOTEM] The consensus timeout
> expired
> May 19 17:15:03 node10 openais[20326]: [TOTEM] entering GATHER state from
> 3.
> May 19 17:15:14 node10 openais[20326]: [TOTEM] The consensus timeout
> expired
> ...
>
> This is on an IBM BladeCenter with 8-way fast nodes. There are no
> errors on the network interface and looking at the tcpdump it
> looks like all packets are delivered.
>
> The totem configuration:
>
> totem {
>        version: 2
>        token:          5000
>        token_retransmits_before_loss_const: 10
>        join:           1000
>        consensus:      6000
>        vsftype:        none
>        max_messages:   20
>        send_join: 45
>        clear_node_high_bit: yes
>        secauth:        off
>        threads:        8
>        interface {
>                ringnumber: 0
>                bindnetaddr: 192.168.58.0
>                mcastaddr: 226.94.1.1
>                mcastport: 5405
>        }
> }
>
> We tried fiddling a bit with token, join, and consensus
> parameters, but nothing helped.
>
> If anybody can shed some light on the matter I'd really
> appreciate. Please let me know if you need more information.
>
> Cheers,
>
> Dejan
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to