Re: [devel] [PATCH 3 of 3] clm: clmna to be respawned by nid when node join request fails [#816]

Mathivanan Naickan Palanivelu Thu, 03 Apr 2014 05:12:55 -0700

There is no functional need for a retry when it is not a communication problem 
there.
The default number of re-spawns has been 3 all along. Yes, we can increase it 
to a bigger value.
It is up to the system integrator to decide what to do when OpenSAF startup 
fails.


Thanks,
Mathi.

----- [email protected] wrote:

> On the controller I get an extra "send failed" log message that is
> ugly:
> 
> Apr  3 13:40:01 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
> 'PL-6' is not a configured cluster node.
> Apr  3 13:40:01 SC-1 local0.notice osafclmd[417]: NO
> /etc/opensaf/node_name should contain the rdn value of configured 
> CLM node object name
> Apr  3 13:40:02 SC-1 local0.notice osafclmd[417]: NO
> proc_initialize_msg: send failed. dest:2060f19cd6007
> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
> 'PL-6' is not a configured cluster node.
> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO
> /etc/opensaf/node_name should contain the rdn value of configured 
> CLM node object name
> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO
> proc_initialize_msg: send failed. dest:2060f19cda006
> Apr  3 13:40:32 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
> 'PL-6' is not a configured cluster node.
> Apr  3 13:40:32 SC-1 local0.notice osafclmd[417]: NO
> /etc/opensaf/node_name should contain the rdn value of configured 
> CLM node object name
> Apr  3 13:40:33 SC-1 local0.notice osafclmd[417]: NO
> proc_initialize_msg: send failed. dest:2060f19cda009
> Apr  3 13:40:34 SC-1 local0.notice osafimmnd[388]: NO Global discard
> node received for nodeId:2060f pid:359
> Apr  3 13:40:39 SC-1 user.warn kernel: tipc: Resetting link
> <1.1.1:eth0-1.1.6:eth0>, peer not responding
> 
> I would like to have a try again loop inside clmna instead. Then we
> could have the nid to supervise and every now and 
> then do a node reboot. There is not much point in trying three times
> and give up. I think clmna should retry forever and 
> let nodeinit handle the bigger loop. The try interval needs to be
> configured but should have good default such as every 
> 30 sec.
> 
> Thanks,
> Hans
> 
> On 03/27/2014 02:13 AM, [email protected] wrote:
> >   osaf/services/saf/clmsv/nodeagent/main.c |  8 ++++++++
> >   1 files changed, 8 insertions(+), 0 deletions(-)
> >
> >
> > When a node join request for an unconfigured/misconfigured node or
> > when a node join request with a duplicate node_name is attempted,
> then
> > clmna should report those errors to NID such that NID attempts to
> > respawan clmna.
> > With the introduction of this change, the following happens(can be
> seen in the syslog) in
> > the case of a unconfigured/misconfigured node join request:
> > At the ACTIVE controller syslog:
> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO CLM NodeName:
> 'PL-8' is not a configured cluster node.
> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO
> /etc/opensaf/node_name should contain the rdn value of a configured
> CLM node object name
> >
> > At the unconfigured/misconfigured node, the syslog will be like as
> below:
> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: Started
> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER PL-8 is not a
> configured node
> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Failed  
> DESC:CLMNA
> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Going for
> recovery
> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN
> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1
> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to
> CLMNA, pid=868
> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER Exiting
> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: exiting for
> shutdown
> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: Started
> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER PL-8 is not a
> configured node
> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER Exiting
> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN
> CLMNA
> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Failed  
> DESC:CLMNA
> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN
> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2
> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to
> CLMNA, pid=892
> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: exiting for
> shutdown
> > Mar 26 19:04:24 PL-3 local0.notice osafclmna[919]: Started
> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER PL-8 is not a
> configured node
> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER Exiting
> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN
> CLMNA
> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Failed  
> DESC:CLMNA
> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER FAILED TO RESPAWN
> > Mar 26 19:04:27 PL-3 local0.notice osafclmna[919]: exiting for
> shutdown
> > Mar 26 19:04:28 PL-3 local0.notice osafimmnd[864]: exiting for
> shutdown
> >
> >
> > For cases when a duplicate node join request comes, the following
> syslog
> > message will be seen at the ACTIVE controller:
> > Mar 26 19:07:43 SC-1 local0.err osafclmd[418]: ER Duplicate node
> join request for CLM node: 'SC-2'. Specify a unique node name
> in/etc/opensaf/node_name
> > Mar 26 19:07:59 SC-1 local0.err osafclmd[418]: ER Duplicate node
> join request for CLM node: 'SC-2'. Specify a unique node name
> in/etc/opensaf/node_name
> >
> > And the following will be seen at the node on which the duplicate
> request is attempted:
> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER SC-2 is already
> up. Specify a unique name in/etc/opensaf/node_name
> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Failed  
> DESC:CLMNA
> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Going for
> recovery
> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN
> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1
> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL
> to CLMNA, pid=1456
> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER Exiting
> > Mar 26 19:07:44 PL-3 local0.notice osafclmna[1459]: exiting for
> shutdown
> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: Started
> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER SC-2 is already
> up. Specify a unique name in/etc/opensaf/node_name
> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER Exiting
> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Could Not RESPAWN
> CLMNA
> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Failed  
> DESC:CLMNA
> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN
> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2
> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL
> to CLMNA, pid=1480
> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: exiting for
> shutdown
> >
> > diff --git a/osaf/services/saf/clmsv/nodeagent/main.c
> b/osaf/services/saf/clmsv/nodeagent/main.c
> > --- a/osaf/services/saf/clmsv/nodeagent/main.c
> > +++ b/osaf/services/saf/clmsv/nodeagent/main.c
> > @@ -458,11 +458,19 @@ SaAisErrorT clmna_process_dummyup_msg(vo
> >                     LOG_ER("%s is not a configured node",
> >                             
> > o_msg->info.api_resp_info.param.node_name.value);
> >                     free(o_msg);
> > +                   rc = error; /* For now, just pass on the error to nid.
> > +                                * This is not needed in future when node 
> > local
> > +                                * cluster management policy based 
> > decisions can be made.
> > +                                */
> >                     goto done;
> >             } else if (error == SA_AIS_ERR_EXIST) {
> >                     LOG_ER("%s is already up. Specify a unique name in"
> PKGSYSCONFDIR "/node_name",
> >                             
> > o_msg->info.api_resp_info.param.node_name.value);
> >                     free(o_msg);
> > +                   rc = error; /* This is not needed in future when node 
> > local
> > +                                * cluster management policy based 
> > decisions can be made.
> > +                                * For now, just pass on the error to nid.
> > +                                */
> >                     goto done;
> >             }
> >
> >
> >

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 3 of 3] clm: clmna to be respawned by nid when node join request fails [#816]

Reply via email to