On Wed, Aug 17, 2011 at 01:19:53PM +1200, Tim Beale wrote:
> Hi,
> 
> I'm resending this patch in a separate thread because I think this part of the
> cluster formation problems I'm seeing has been overlooked. The patch attached
> is one way of addressing the problem, but I'm open to alternatives.
> 
> Basically the problem is that if the cluster experiences formation problems,
> then CPG can sometimes choose a downlist that includes the local node. When
> it processes the node leave event for itself it sets its cpd state to
> CPD_STATE_UNJOINED and clears the cpd->group_name. This means CPG events are 
> no
> longer sent to the CPG client, because the cpd->group_name no longer matches.
> 
> This patch avoids the problem by only clearing the group_name if cpg_leave() 
> is
> called and not when processing a downlist leave event. I'm not 100% sure about
> the case where the CPG client exits unexpectedly (in which case the reason is
> also CONFCHG_CPG_REASON_PROCDOWN), but I figure the cpd info gets cleaned up
> immediately on the local node if this happens.
> 

Tim, this seems reasonable to me. But it would be good to get Honza to
review this as he wrote it.

-Angus

> Regards,
> Tim
> 
> ---
> 
>  services/cpg.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/services/cpg.c b/services/cpg.c
> index 8e71dcf..c66037b 100644
> --- a/services/cpg.c
> +++ b/services/cpg.c
> @@ -683,7 +683,8 @@ static int notify_lib_joinlist(
>                               }
>                               if (left_list_entries) {
>                                       if (left_list[0].pid == cpd->pid &&
> -                                             left_list[0].nodeid == 
> api->totem_nodeid_get()) {
> +                                             left_list[0].nodeid == 
> api->totem_nodeid_get() &&
> +                                             left_list[0].reason == 
> CONFCHG_CPG_REASON_LEAVE) {
> 
>                                               cpd->pid = 0;
>                                               memset (&cpd->group_name, 0, 
> sizeof(cpd->group_name));

> From: Tim Beale <[email protected]>
> 
> A CPG client can sometimes lockup if the local node is in the downlist
> 
> In a 10-node cluster where all nodes are booting up and starting corosync
> at the same time, sometimes during this process corosync detects a node as
> leaving and rejoining the cluster.
> 
> Occasionally the downlist that gets picked contains the local node. When the
> local node sends leave events for the downlist (including itself), it sets
> its cpd state to CPD_STATE_UNJOINED and clears the cpd->group_name. This
> means it no longer sends CPG events to the CPG client.
> 
> ---
> 
>  services/cpg.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/services/cpg.c b/services/cpg.c
> index 8e71dcf..c66037b 100644
> --- a/services/cpg.c
> +++ b/services/cpg.c
> @@ -683,7 +683,8 @@ static int notify_lib_joinlist(
>                               }
>                               if (left_list_entries) {
>                                       if (left_list[0].pid == cpd->pid &&
> -                                             left_list[0].nodeid == 
> api->totem_nodeid_get()) {
> +                                             left_list[0].nodeid == 
> api->totem_nodeid_get() &&
> +                                             left_list[0].reason == 
> CONFCHG_CPG_REASON_LEAVE) {
>  
>                                               cpd->pid = 0;
>                                               memset (&cpd->group_name, 0, 
> sizeof(cpd->group_name));

> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to