The analysis is :
csi set cbk for quisced comes at smfd at 18:53:31, smfd doesn't send response
to Amf from amf_quiesced_state_handler because it can only sent it after it
receive quisced ack from mds.
<141>1 2018-09-14T18:53:31.67077+05:30 osafsmfd [meta sequenceId="1309"]
6065:smf/smfd/smfd_amf.cc:153 NO MDS amf_quiesced_state_handler:
smfd_mds_change_role()
Check the code in amf_csi_set_callback():
if (new_haState == SA_AMF_HA_QUIESCED)
/* AMF response will be done later when MDS quiesced ack has been received
*/
goto done;
So, smfd didn't sent response to amf yet. It is waiting for Mds quisced ack
from mds.
But, smfnd down and up event come to Smfd before mds_quisced ack comes to Smfd.
And Smfd stuck in saClmInitialize() api call and never comes out of
clma_startup() beccause there is no trace of leaving clma_startup(). This is
the root cause of amf csi set cbk timeout as Smfd never sent the response to
amf as it is stuck there in saClmInitialize().
<143>1 2018-09-14T18:53:31.906809+05:30 osafsmfd [meta sequenceId="1319"]
6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2020f, svc version
2 role 1
<143>1 2018-09-14T18:53:31.906875+05:30 osafsmfd [meta sequenceId="1320"]
6065:smf/smfd/smfd_smfnd.c:210 TR SMFND DOWN for node id 2020f
<143>1 2018-09-14T18:53:31.906883+05:30 osafsmfd [meta sequenceId="1321"]
6065:smf/smfd/smfd_smfnd.c:228 TR SMFND state updated for node
[safNode=SC-2,safCluster=myClmCluster], id [2020f]
<143>1 2018-09-14T18:53:31.954053+05:30 osafsmfd [meta sequenceId="1322"]
6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 3 for nodeid 2020f, svc version
2 role 1
<143>1 2018-09-14T18:53:31.954085+05:30 osafsmfd [meta sequenceId="1323"]
6065:smf/smfd/smfd_smfnd.c:124 TR SMFND UP for node id 2020f, version 2
<143>1 2018-09-14T18:53:31.954092+05:30 osafsmfd [meta sequenceId="1324"]
6065:clm/agent/clma_api.cc:538 >> saClmInitialize
<143>1 2018-09-14T18:53:31. [meta sequenceId="1325"]
6065:clm/agent/clma_api.cc:630 >> clmainitialize
<143>1 2018-09-14T18:53:31.954099+05:30 osafsmfd [meta sequenceId="1326"]
6065:clm/agent/clma_util.cc:99 >> clma_startup: clma_use_count: 0
<143>1 2018-09-14T18:53:31.954117+05:30 osafsmfd [meta sequenceId="1327"]
6065:clm/agent/clma_mds.cc:1116 >> clma_mds_init
<143>1 2018-09-14T18:53:31.95585+05:30 osafsmfd [meta sequenceId="1328"]
6065:clm/agent/clma_mds.cc:1164 << clma_mds_init
<143>1 2018-09-14T18:53:53.204569+05:30 osafsmfd [meta sequenceId="1329"]
6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2030f, svc version
2 role 1
<143>1 2018-09-14T18:54:01.774991+05:30 osafsmfd [meta sequenceId="1330"]
6070:smf/smfd/smfd_mds.c:294 TR SMFD SVC event 6 for nodeid 2020f, svc version
1 role 2
<143>1 2018-09-14T18:54:01.775047+05:30 osafsmfd [meta sequenceId="1331"]
6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2020f, svc version
2 role 1
---
** [tickets:#2061] smfd faulted on Active controller due to
csiSetcallbackTimeout during si-swap operation**
**Status:** assigned
**Milestone:** 5.18.09
**Created:** Thu Sep 22, 2016 09:26 AM UTC by Ritu Raj
**Last Updated:** Fri Sep 14, 2018 02:13 PM UTC
**Owner:** Krishna Pawar
**Attachments:**
-
[SC-1.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2061/attachment/SC-1.tar.bz2)
(178.0 kB; application/x-bzip)
-
[SC-2.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2061/attachment/SC-2.tar.bz2)
(206.3 kB; application/x-bzip)
# Environment details
OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 3 nodes ( 2 controllers and 1 payloads with headless feature disabled &
1PBE with 100K objects)
# Summary
smfd faulted on Active(Previous active) controller due to csiSetcallbackTimeout
during si-swap operation
# Steps followed & Observed behaviour
1. Initiate si-swap operation from Active Controller, simultaneously killed
osafsmfnd on STANDBY and osafckptnd on Payload(PL-3)
2. Observed that, during role change smfd faulted on Active
>From the traces, it is observed that:
** In the file "osaf/services/saf/smfsv/smfd/smfd_smfnd.c"
there is no TRY_AGAIN mechanism for below api
~~~
/* Find Clm info about the node */
rc = saClmInitialize(&clmHandle, NULL, &clmVersion);
if (rc != SA_AIS_OK) {
LOG_ER("saClmInitialize failed, rc=%s", saf_error(rc));
if (newNode) free(smfnd);
pthread_mutex_unlock(&smfnd_list_lock);
return NCSCC_RC_FAILURE;
}
/* Get Clm info about the node */
SaClmClusterNodeT clmInfo;
rc = saClmClusterNodeGet(clmHandle, i_node_id,
10000000000LL, &clmInfo);
if (rc != SA_AIS_OK) {
LOG_ER("saClmClusterNodeGet failed, rc=%s", saf_error(rc));
if (newNode) free(smfnd);
rc = saClmFinalize(clmHandle);
if (rc != SA_AIS_OK) {
LOG_ER("saClmFinalize failed, rc=%s", saf_error(rc));
}
pthread_mutex_unlock(&smfnd_list_lock);
return NCSCC_RC_FAILURE;
}
~~~
**Syslog :
Sep 22 14:15:05 fos1 osafimmnd[6164]: NO Implementer disconnected 17 <0,
2030f(down)> (MsgQueueService131855)
Sep 22 14:15:08 fos1 osafamfnd[6253]: NO
'safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'csiSetcallbackTimeout' : Recovery is 'nodeFailfast'
Sep 22 14:15:08 fos1 osafamfnd[6253]: ER
safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:csiSetcallbackTimeout Recovery is:nodeFailfast
Sep 22 14:15:08 fos1 osafamfnd[6253]: Rebooting OpenSAF NodeId = 131343 EE Name
= , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343,
SupervisionTime = 60
Sep 22 14:15:08 fos1 opensaf_reboot: Rebooting local node; timeout=60
Sep 22 14:15:09 fos1 osafsmfd[6272]: ER saClmInitialize failed,
rc=SA_AIS_ERR_TRY_AGAIN (6)
Sep 22 14:15:09 fos1 osafsmfd[6272]: WA proc_mds_info: SMFND UP failed
**Notes:
1. Syslog of controller's attached
2. smfd tarces attached
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list._______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets