There is no functional need for a retry when it is not a communication problem there. The default number of re-spawns has been 3 all along. Yes, we can increase it to a bigger value. It is up to the system integrator to decide what to do when OpenSAF startup fails.
Thanks, Mathi. ----- [email protected] wrote: > On the controller I get an extra "send failed" log message that is > ugly: > > Apr 3 13:40:01 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: > 'PL-6' is not a configured cluster node. > Apr 3 13:40:01 SC-1 local0.notice osafclmd[417]: NO > /etc/opensaf/node_name should contain the rdn value of configured > CLM node object name > Apr 3 13:40:02 SC-1 local0.notice osafclmd[417]: NO > proc_initialize_msg: send failed. dest:2060f19cd6007 > Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: > 'PL-6' is not a configured cluster node. > Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO > /etc/opensaf/node_name should contain the rdn value of configured > CLM node object name > Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO > proc_initialize_msg: send failed. dest:2060f19cda006 > Apr 3 13:40:32 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: > 'PL-6' is not a configured cluster node. > Apr 3 13:40:32 SC-1 local0.notice osafclmd[417]: NO > /etc/opensaf/node_name should contain the rdn value of configured > CLM node object name > Apr 3 13:40:33 SC-1 local0.notice osafclmd[417]: NO > proc_initialize_msg: send failed. dest:2060f19cda009 > Apr 3 13:40:34 SC-1 local0.notice osafimmnd[388]: NO Global discard > node received for nodeId:2060f pid:359 > Apr 3 13:40:39 SC-1 user.warn kernel: tipc: Resetting link > <1.1.1:eth0-1.1.6:eth0>, peer not responding > > I would like to have a try again loop inside clmna instead. Then we > could have the nid to supervise and every now and > then do a node reboot. There is not much point in trying three times > and give up. I think clmna should retry forever and > let nodeinit handle the bigger loop. The try interval needs to be > configured but should have good default such as every > 30 sec. > > Thanks, > Hans > > On 03/27/2014 02:13 AM, [email protected] wrote: > > osaf/services/saf/clmsv/nodeagent/main.c | 8 ++++++++ > > 1 files changed, 8 insertions(+), 0 deletions(-) > > > > > > When a node join request for an unconfigured/misconfigured node or > > when a node join request with a duplicate node_name is attempted, > then > > clmna should report those errors to NID such that NID attempts to > > respawan clmna. > > With the introduction of this change, the following happens(can be > seen in the syslog) in > > the case of a unconfigured/misconfigured node join request: > > At the ACTIVE controller syslog: > > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO CLM NodeName: > 'PL-8' is not a configured cluster node. > > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO > /etc/opensaf/node_name should contain the rdn value of a configured > CLM node object name > > > > At the unconfigured/misconfigured node, the syslog will be like as > below: > > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: Started > > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER PL-8 is not a > configured node > > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Failed > DESC:CLMNA > > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Going for > recovery > > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN > /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1 > > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to > CLMNA, pid=868 > > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER Exiting > > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: exiting for > shutdown > > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: Started > > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER PL-8 is not a > configured node > > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER Exiting > > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN > CLMNA > > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Failed > DESC:CLMNA > > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN > /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2 > > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to > CLMNA, pid=892 > > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: exiting for > shutdown > > Mar 26 19:04:24 PL-3 local0.notice osafclmna[919]: Started > > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER PL-8 is not a > configured node > > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER Exiting > > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN > CLMNA > > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Failed > DESC:CLMNA > > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER FAILED TO RESPAWN > > Mar 26 19:04:27 PL-3 local0.notice osafclmna[919]: exiting for > shutdown > > Mar 26 19:04:28 PL-3 local0.notice osafimmnd[864]: exiting for > shutdown > > > > > > For cases when a duplicate node join request comes, the following > syslog > > message will be seen at the ACTIVE controller: > > Mar 26 19:07:43 SC-1 local0.err osafclmd[418]: ER Duplicate node > join request for CLM node: 'SC-2'. Specify a unique node name > in/etc/opensaf/node_name > > Mar 26 19:07:59 SC-1 local0.err osafclmd[418]: ER Duplicate node > join request for CLM node: 'SC-2'. Specify a unique node name > in/etc/opensaf/node_name > > > > And the following will be seen at the node on which the duplicate > request is attempted: > > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER SC-2 is already > up. Specify a unique name in/etc/opensaf/node_name > > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Failed > DESC:CLMNA > > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Going for > recovery > > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN > /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1 > > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL > to CLMNA, pid=1456 > > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER Exiting > > Mar 26 19:07:44 PL-3 local0.notice osafclmna[1459]: exiting for > shutdown > > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: Started > > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER SC-2 is already > up. Specify a unique name in/etc/opensaf/node_name > > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER Exiting > > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Could Not RESPAWN > CLMNA > > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Failed > DESC:CLMNA > > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN > /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2 > > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL > to CLMNA, pid=1480 > > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: exiting for > shutdown > > > > diff --git a/osaf/services/saf/clmsv/nodeagent/main.c > b/osaf/services/saf/clmsv/nodeagent/main.c > > --- a/osaf/services/saf/clmsv/nodeagent/main.c > > +++ b/osaf/services/saf/clmsv/nodeagent/main.c > > @@ -458,11 +458,19 @@ SaAisErrorT clmna_process_dummyup_msg(vo > > LOG_ER("%s is not a configured node", > > > > o_msg->info.api_resp_info.param.node_name.value); > > free(o_msg); > > + rc = error; /* For now, just pass on the error to nid. > > + * This is not needed in future when node > > local > > + * cluster management policy based > > decisions can be made. > > + */ > > goto done; > > } else if (error == SA_AIS_ERR_EXIST) { > > LOG_ER("%s is already up. Specify a unique name in" > PKGSYSCONFDIR "/node_name", > > > > o_msg->info.api_resp_info.param.node_name.value); > > free(o_msg); > > + rc = error; /* This is not needed in future when node > > local > > + * cluster management policy based > > decisions can be made. > > + * For now, just pass on the error to nid. > > + */ > > goto done; > > } > > > > > > ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
