Re: [Openais] Question about totem message "FAILED TO RECEIVE"

Steven Dake Thu, 19 Jun 2008 01:16:41 -0700

increase failed_to_recv_const in the configuration.  Right now its probably
about 30.


It represents the number of rotations where a message should have been
received via multicast but it was not.

It is usually a sign of overloaded network (as you mentioned) or possibly
coupled with a poorly designed switch.

As far as the todo, which version of openais are you using?  I believe i
fixed this in whitetank and also the trunk forward port should have this
problem cleared up.

regards
-steve

On Wed, Jun 18, 2008 at 6:36 PM, Angus & Anna Salkeld <[EMAIL PROTECTED]>
wrote:

> Hi
>
> In a VERY busy network (storm conditions) I am getting the following
> log messages continuously:
>
> 01:19:17 awplus openais[1474]: [TOTEM] FAILED TO RECEIVE
> 01:19:17 awplus openais[1474]: [amf.c:1365] >amf_confchg_fn: mnum: 2,
> jnum: 0, lnum: 0, sync state: NORMAL_OPERATION, ring ID 356 rep
> 192.168.255.1
> 01:19:17 awplus openais[1474]: [amf.c:1365] >amf_confchg_fn: mnum: 2,
> jnum: 0, lnum: 0, sync state: NORMAL_OPERATION, ring ID 356 rep
> 192.168.255.1
> 01:19:17 awplus openais[1474]: [SYNC ] This node is within the primary
> component and will provide service.
> 01:19:17 awplus openais[1474]: [TOTEM] entering OPERATIONAL state.
> 01:19:24 awplus openais[1474]: [amf.c:1365] >amf_confchg_fn: mnum: 2,
> jnum: 0, lnum: 0, sync state: NORMAL_OPERATION, ring ID 360 rep
> 192.168.255.1
> 01:19:24 awplus openais[1474]: [amf.c:1365] >amf_confchg_fn: mnum: 2,
> jnum: 0, lnum: 0, sync state: NORMAL_OPERATION, ring ID 360 rep
> 192.168.255.1
> 01:19:24 awplus openais[1474]: [SYNC ] This node is within the primary
> component and will provide service.
> 01:19:24 awplus openais[1474]: [TOTEM] entering OPERATIONAL state.
>
>
> In exec/totemsrp.c line ~3333 there is the following TODO:
>
>                if (instance->my_aru_count >
> instance->totem_config->fail_to_recv_const &&
>                        token->aru_addr != instance->my_id.addr[0].nodeid) {
>
>                        log_printf (instance->totemsrp_log_level_error,
>                                "FAILED TO RECEIVE\n");
> // TODO if we fail to receive, it may be possible to end with a gather
> // state of proc == failed = 0 entries
> /* THIS IS A BIG TODO
>                        memb_set_merge (&token->aru_addr, 1,
>                                instance->my_failed_list,
>                                &instance->my_failed_list_entries);
> */
>
>                        ring_state_restore (instance);
>
>                        memb_state_gather_enter (instance, 6);
>                } else {
>
> My questions are:
> 1] Am I right in the following:
>   - totem thinks that it has lost a node
>   - it send a member join message
>   - the member joins quite happily
>   - repeat the above sequence
>
> 2] If the above is true what can I do to prevent the state flap?
>    - within this horrible network
>
> 3] What needs to be done in the TODO (the comment is a bit cryptic to me)?
>
> Thanks
> Angus Salkeld
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
>

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Question about totem message "FAILED TO RECEIVE"

Reply via email to