On Wed, Aug 17, 2011 at 01:19:53PM +1200, Tim Beale wrote:
> Hi,
>
> I'm resending this patch in a separate thread because I think this part of the
> cluster formation problems I'm seeing has been overlooked. The patch attached
> is one way of addressing the problem, but I'm open to alternatives.
>
> Basically the problem is that if the cluster experiences formation problems,
> then CPG can sometimes choose a downlist that includes the local node. When
> it processes the node leave event for itself it sets its cpd state to
> CPD_STATE_UNJOINED and clears the cpd->group_name. This means CPG events are
> no
> longer sent to the CPG client, because the cpd->group_name no longer matches.
>
> This patch avoids the problem by only clearing the group_name if cpg_leave()
> is
> called and not when processing a downlist leave event. I'm not 100% sure about
> the case where the CPG client exits unexpectedly (in which case the reason is
> also CONFCHG_CPG_REASON_PROCDOWN), but I figure the cpd info gets cleaned up
> immediately on the local node if this happens.
>
Tim, this seems reasonable to me. But it would be good to get Honza to
review this as he wrote it.
-Angus
> Regards,
> Tim
>
> ---
>
> services/cpg.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/services/cpg.c b/services/cpg.c
> index 8e71dcf..c66037b 100644
> --- a/services/cpg.c
> +++ b/services/cpg.c
> @@ -683,7 +683,8 @@ static int notify_lib_joinlist(
> }
> if (left_list_entries) {
> if (left_list[0].pid == cpd->pid &&
> - left_list[0].nodeid ==
> api->totem_nodeid_get()) {
> + left_list[0].nodeid ==
> api->totem_nodeid_get() &&
> + left_list[0].reason ==
> CONFCHG_CPG_REASON_LEAVE) {
>
> cpd->pid = 0;
> memset (&cpd->group_name, 0,
> sizeof(cpd->group_name));
> From: Tim Beale <[email protected]>
>
> A CPG client can sometimes lockup if the local node is in the downlist
>
> In a 10-node cluster where all nodes are booting up and starting corosync
> at the same time, sometimes during this process corosync detects a node as
> leaving and rejoining the cluster.
>
> Occasionally the downlist that gets picked contains the local node. When the
> local node sends leave events for the downlist (including itself), it sets
> its cpd state to CPD_STATE_UNJOINED and clears the cpd->group_name. This
> means it no longer sends CPG events to the CPG client.
>
> ---
>
> services/cpg.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/services/cpg.c b/services/cpg.c
> index 8e71dcf..c66037b 100644
> --- a/services/cpg.c
> +++ b/services/cpg.c
> @@ -683,7 +683,8 @@ static int notify_lib_joinlist(
> }
> if (left_list_entries) {
> if (left_list[0].pid == cpd->pid &&
> - left_list[0].nodeid ==
> api->totem_nodeid_get()) {
> + left_list[0].nodeid ==
> api->totem_nodeid_get() &&
> + left_list[0].reason ==
> CONFCHG_CPG_REASON_LEAVE) {
>
> cpd->pid = 0;
> memset (&cpd->group_name, 0,
> sizeof(cpd->group_name));
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais