Re: [devel] [PATCH 3 of 3] clm: clmna to be respawned by nid when node join request fails [#816]

Hans Feldt Thu, 03 Apr 2014 04:49:28 -0700

On the controller I get an extra "send failed" log message that is ugly:


Apr  3 13:40:01 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: 'PL-6' is 
not a configured cluster node.
Apr  3 13:40:01 SC-1 local0.notice osafclmd[417]: NO /etc/opensaf/node_name 
should contain the rdn value of configured 
CLM node object name
Apr  3 13:40:02 SC-1 local0.notice osafclmd[417]: NO proc_initialize_msg: send 
failed. dest:2060f19cd6007
Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: 'PL-6' is 
not a configured cluster node.
Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO /etc/opensaf/node_name 
should contain the rdn value of configured 
CLM node object name
Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO proc_initialize_msg: send 
failed. dest:2060f19cda006
Apr  3 13:40:32 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: 'PL-6' is 
not a configured cluster node.
Apr  3 13:40:32 SC-1 local0.notice osafclmd[417]: NO /etc/opensaf/node_name 
should contain the rdn value of configured 
CLM node object name
Apr  3 13:40:33 SC-1 local0.notice osafclmd[417]: NO proc_initialize_msg: send 
failed. dest:2060f19cda009
Apr  3 13:40:34 SC-1 local0.notice osafimmnd[388]: NO Global discard node 
received for nodeId:2060f pid:359
Apr  3 13:40:39 SC-1 user.warn kernel: tipc: Resetting link 
<1.1.1:eth0-1.1.6:eth0>, peer not responding

I would like to have a try again loop inside clmna instead. Then we could have 
the nid to supervise and every now and 
then do a node reboot. There is not much point in trying three times and give 
up. I think clmna should retry forever and 
let nodeinit handle the bigger loop. The try interval needs to be configured 
but should have good default such as every 
30 sec.

Thanks,
Hans

On 03/27/2014 02:13 AM, [email protected] wrote:
>   osaf/services/saf/clmsv/nodeagent/main.c |  8 ++++++++
>   1 files changed, 8 insertions(+), 0 deletions(-)
>
>
> When a node join request for an unconfigured/misconfigured node or
> when a node join request with a duplicate node_name is attempted, then
> clmna should report those errors to NID such that NID attempts to
> respawan clmna.
> With the introduction of this change, the following happens(can be seen in 
> the syslog) in
> the case of a unconfigured/misconfigured node join request:
> At the ACTIVE controller syslog:
> Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO CLM NodeName: 'PL-8' is 
> not a configured cluster node.
> Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO /etc/opensaf/node_name 
> should contain the rdn value of a configured CLM node object name
>
> At the unconfigured/misconfigured node, the syslog will be like as below:
> Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: Started
> Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER PL-8 is not a configured 
> node
> Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Failed   DESC:CLMNA
> Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Going for recovery
> Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN 
> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1
> Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to CLMNA, 
> pid=868
> Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER Exiting
> Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: exiting for shutdown
> Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: Started
> Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER PL-8 is not a configured 
> node
> Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER Exiting
> Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN CLMNA
> Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Failed   DESC:CLMNA
> Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN 
> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2
> Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to CLMNA, 
> pid=892
> Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: exiting for shutdown
> Mar 26 19:04:24 PL-3 local0.notice osafclmna[919]: Started
> Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER PL-8 is not a configured 
> node
> Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER Exiting
> Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN CLMNA
> Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Failed   DESC:CLMNA
> Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER FAILED TO RESPAWN
> Mar 26 19:04:27 PL-3 local0.notice osafclmna[919]: exiting for shutdown
> Mar 26 19:04:28 PL-3 local0.notice osafimmnd[864]: exiting for shutdown
>
>
> For cases when a duplicate node join request comes, the following syslog
> message will be seen at the ACTIVE controller:
> Mar 26 19:07:43 SC-1 local0.err osafclmd[418]: ER Duplicate node join request 
> for CLM node: 'SC-2'. Specify a unique node name in/etc/opensaf/node_name
> Mar 26 19:07:59 SC-1 local0.err osafclmd[418]: ER Duplicate node join request 
> for CLM node: 'SC-2'. Specify a unique node name in/etc/opensaf/node_name
>
> And the following will be seen at the node on which the duplicate request is 
> attempted:
> Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER SC-2 is already up. 
> Specify a unique name in/etc/opensaf/node_name
> Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Failed   DESC:CLMNA
> Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Going for recovery
> Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN 
> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1
> Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL to CLMNA, 
> pid=1456
> Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER Exiting
> Mar 26 19:07:44 PL-3 local0.notice osafclmna[1459]: exiting for shutdown
> Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: Started
> Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER SC-2 is already up. 
> Specify a unique name in/etc/opensaf/node_name
> Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER Exiting
> Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Could Not RESPAWN CLMNA
> Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Failed   DESC:CLMNA
> Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN 
> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2
> Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL to CLMNA, 
> pid=1480
> Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: exiting for shutdown
>
> diff --git a/osaf/services/saf/clmsv/nodeagent/main.c 
> b/osaf/services/saf/clmsv/nodeagent/main.c
> --- a/osaf/services/saf/clmsv/nodeagent/main.c
> +++ b/osaf/services/saf/clmsv/nodeagent/main.c
> @@ -458,11 +458,19 @@ SaAisErrorT clmna_process_dummyup_msg(vo
>                       LOG_ER("%s is not a configured node",
>                               
> o_msg->info.api_resp_info.param.node_name.value);
>                       free(o_msg);
> +                     rc = error; /* For now, just pass on the error to nid.
> +                                  * This is not needed in future when node 
> local
> +                                  * cluster management policy based 
> decisions can be made.
> +                                  */
>                       goto done;
>               } else if (error == SA_AIS_ERR_EXIST) {
>                       LOG_ER("%s is already up. Specify a unique name in" 
> PKGSYSCONFDIR "/node_name",
>                               
> o_msg->info.api_resp_info.param.node_name.value);
>                       free(o_msg);
> +                     rc = error; /* This is not needed in future when node 
> local
> +                                  * cluster management policy based 
> decisions can be made.
> +                                  * For now, just pass on the error to nid.
> +                                  */
>                       goto done;
>               }
>
>
>

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 3 of 3] clm: clmna to be respawned by nid when node join request fails [#816]

Reply via email to