Re: [devel] [PATCH 3 of 3] clm: clmna to be respawned by nid when node join request fails [#816]

Hans Feldt Thu, 03 Apr 2014 22:55:23 -0700

Fine let's see that as a possible future change. Remember removing nid
has been discussed, then we will have no retry at all without my
proposed change.


About the log "NO proc_initialize_msg: send failed", do you think it
comes from the fact that the payload start is not correct? clmna
seemed to return to nid even not a member and then amfnd tried to use
CLM? Should it be OK to use the CLM API on a non member node?

Thanks,
Hans

On 3 April 2014 14:12, Mathivanan Naickan Palanivelu
<[email protected]> wrote:
> There is no functional need for a retry when it is not a communication 
> problem there.
> The default number of re-spawns has been 3 all along. Yes, we can increase it 
> to a bigger value.
> It is up to the system integrator to decide what to do when OpenSAF startup 
> fails.
>
> Thanks,
> Mathi.
>
> ----- [email protected] wrote:
>
>> On the controller I get an extra "send failed" log message that is
>> ugly:
>>
>> Apr  3 13:40:01 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
>> 'PL-6' is not a configured cluster node.
>> Apr  3 13:40:01 SC-1 local0.notice osafclmd[417]: NO
>> /etc/opensaf/node_name should contain the rdn value of configured
>> CLM node object name
>> Apr  3 13:40:02 SC-1 local0.notice osafclmd[417]: NO
>> proc_initialize_msg: send failed. dest:2060f19cd6007
>> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
>> 'PL-6' is not a configured cluster node.
>> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO
>> /etc/opensaf/node_name should contain the rdn value of configured
>> CLM node object name
>> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO
>> proc_initialize_msg: send failed. dest:2060f19cda006
>> Apr  3 13:40:32 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
>> 'PL-6' is not a configured cluster node.
>> Apr  3 13:40:32 SC-1 local0.notice osafclmd[417]: NO
>> /etc/opensaf/node_name should contain the rdn value of configured
>> CLM node object name
>> Apr  3 13:40:33 SC-1 local0.notice osafclmd[417]: NO
>> proc_initialize_msg: send failed. dest:2060f19cda009
>> Apr  3 13:40:34 SC-1 local0.notice osafimmnd[388]: NO Global discard
>> node received for nodeId:2060f pid:359
>> Apr  3 13:40:39 SC-1 user.warn kernel: tipc: Resetting link
>> <1.1.1:eth0-1.1.6:eth0>, peer not responding
>>
>> I would like to have a try again loop inside clmna instead. Then we
>> could have the nid to supervise and every now and
>> then do a node reboot. There is not much point in trying three times
>> and give up. I think clmna should retry forever and
>> let nodeinit handle the bigger loop. The try interval needs to be
>> configured but should have good default such as every
>> 30 sec.
>>
>> Thanks,
>> Hans
>>
>> On 03/27/2014 02:13 AM, [email protected] wrote:
>> >   osaf/services/saf/clmsv/nodeagent/main.c |  8 ++++++++
>> >   1 files changed, 8 insertions(+), 0 deletions(-)
>> >
>> >
>> > When a node join request for an unconfigured/misconfigured node or
>> > when a node join request with a duplicate node_name is attempted,
>> then
>> > clmna should report those errors to NID such that NID attempts to
>> > respawan clmna.
>> > With the introduction of this change, the following happens(can be
>> seen in the syslog) in
>> > the case of a unconfigured/misconfigured node join request:
>> > At the ACTIVE controller syslog:
>> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO CLM NodeName:
>> 'PL-8' is not a configured cluster node.
>> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO
>> /etc/opensaf/node_name should contain the rdn value of a configured
>> CLM node object name
>> >
>> > At the unconfigured/misconfigured node, the syslog will be like as
>> below:
>> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: Started
>> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER PL-8 is not a
>> configured node
>> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Failed
>> DESC:CLMNA
>> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Going for
>> recovery
>> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN
>> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1
>> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to
>> CLMNA, pid=868
>> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER Exiting
>> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: exiting for
>> shutdown
>> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: Started
>> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER PL-8 is not a
>> configured node
>> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER Exiting
>> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN
>> CLMNA
>> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Failed
>> DESC:CLMNA
>> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN
>> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2
>> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to
>> CLMNA, pid=892
>> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: exiting for
>> shutdown
>> > Mar 26 19:04:24 PL-3 local0.notice osafclmna[919]: Started
>> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER PL-8 is not a
>> configured node
>> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER Exiting
>> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN
>> CLMNA
>> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Failed
>> DESC:CLMNA
>> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER FAILED TO RESPAWN
>> > Mar 26 19:04:27 PL-3 local0.notice osafclmna[919]: exiting for
>> shutdown
>> > Mar 26 19:04:28 PL-3 local0.notice osafimmnd[864]: exiting for
>> shutdown
>> >
>> >
>> > For cases when a duplicate node join request comes, the following
>> syslog
>> > message will be seen at the ACTIVE controller:
>> > Mar 26 19:07:43 SC-1 local0.err osafclmd[418]: ER Duplicate node
>> join request for CLM node: 'SC-2'. Specify a unique node name
>> in/etc/opensaf/node_name
>> > Mar 26 19:07:59 SC-1 local0.err osafclmd[418]: ER Duplicate node
>> join request for CLM node: 'SC-2'. Specify a unique node name
>> in/etc/opensaf/node_name
>> >
>> > And the following will be seen at the node on which the duplicate
>> request is attempted:
>> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER SC-2 is already
>> up. Specify a unique name in/etc/opensaf/node_name
>> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Failed
>> DESC:CLMNA
>> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Going for
>> recovery
>> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN
>> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1
>> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL
>> to CLMNA, pid=1456
>> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER Exiting
>> > Mar 26 19:07:44 PL-3 local0.notice osafclmna[1459]: exiting for
>> shutdown
>> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: Started
>> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER SC-2 is already
>> up. Specify a unique name in/etc/opensaf/node_name
>> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER Exiting
>> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Could Not RESPAWN
>> CLMNA
>> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Failed
>> DESC:CLMNA
>> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN
>> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2
>> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL
>> to CLMNA, pid=1480
>> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: exiting for
>> shutdown
>> >
>> > diff --git a/osaf/services/saf/clmsv/nodeagent/main.c
>> b/osaf/services/saf/clmsv/nodeagent/main.c
>> > --- a/osaf/services/saf/clmsv/nodeagent/main.c
>> > +++ b/osaf/services/saf/clmsv/nodeagent/main.c
>> > @@ -458,11 +458,19 @@ SaAisErrorT clmna_process_dummyup_msg(vo
>> >                     LOG_ER("%s is not a configured node",
>> >                             
>> > o_msg->info.api_resp_info.param.node_name.value);
>> >                     free(o_msg);
>> > +                   rc = error; /* For now, just pass on the error to nid.
>> > +                                * This is not needed in future when node 
>> > local
>> > +                                * cluster management policy based 
>> > decisions can be made.
>> > +                                */
>> >                     goto done;
>> >             } else if (error == SA_AIS_ERR_EXIST) {
>> >                     LOG_ER("%s is already up. Specify a unique name in"
>> PKGSYSCONFDIR "/node_name",
>> >                             
>> > o_msg->info.api_resp_info.param.node_name.value);
>> >                     free(o_msg);
>> > +                   rc = error; /* This is not needed in future when node 
>> > local
>> > +                                * cluster management policy based 
>> > decisions can be made.
>> > +                                * For now, just pass on the error to nid.
>> > +                                */
>> >                     goto done;
>> >             }
>> >
>> >
>> >
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 3 of 3] clm: clmna to be respawned by nid when node join request fails [#816]

Reply via email to