Hi Mahesh, Cluster should not be reboot. It should only invoke loading data from PBE. I guess you are testing with only one payload. Please, test with at least 2 payloads. Then you should not see loading from PBE.
This 2 payload requirement is a limitation in IMM due to restart of IMM coordinator process when the cluster goes headless. When IMM coordinator is on payload (it's usually after the first headless state), and if it's the only payload in the cluster, then when IMMND on payload is restarted, there is no one remaining veteran node that hold IMM data, sync is not possible. If you really have cluster reboot, then it seems that you are missing AMF patches from cloud resilience. BR, Zoran From: A V Mahesh [mailto:mahesh.va...@oracle.com] Sent: Friday, February 19, 2016 6:14 AM To: opensaf-devel@lists.sourceforge.net; Zoran Milinkovic; Neelakanta Reddy Subject: Re: [devel] [PATCH 4 of 5] imm: add IMMND support for cloud resilience feature [#1625] Hi Zoran/Neel, It seems their is an issue with IMM , if we restart Both SC`s more then one time ( head less states two time ) cluster is going for reboots. ========================================================================================================================================================= Feb 19 10:35:37 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:38 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:38 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:39 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:39 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:40 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:40 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:41 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:41 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:42 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:42 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:42 PL-3 kernel: [ 2259.813009] tipc: Established link <1.1.3:eth1-1.1.2:eth2> on network plane B Feb 19 10:35:42 PL-3 kernel: [ 2259.814115] tipc: Established link <1.1.3:eth4-1.1.2:eth3> on network plane A Feb 19 10:35:43 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:43 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:44 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:44 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:45 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:45 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:46 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:46 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:47 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:47 PL-3 osafimmnd[5076]: WA MDS Send Failed to service:IMMD rc:2 Feb 19 10:35:47 PL-3 osafimmnd[5076]: NO IMMD service is UP ... ScAbsenseAllowed?:900 introduced?:2 Feb 19 10:35:47 PL-3 osafimmnd[5076]: NO Re-introduce-me highestProcessed:3834 highestReceived:3834 Feb 19 10:35:47 PL-3 osafimmnd[5076]: NO This IMMND is now the NEW Coord Feb 19 10:35:47 PL-3 osafimmnd[5076]: NO SETTING COORD TO 1 CLOUD PROTO Feb 19 10:35:47 PL-3 osafimmnd[5076]: NO This IMMND re-elected coord redundantly, failover ? Feb 19 10:35:48 PL-3 osafimmnd[5076]: NO Announce sync, epoch:6 Feb 19 10:35:48 PL-3 osafimmnd[5076]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER Feb 19 10:35:48 PL-3 osafimmnd[5076]: NO NODE STATE-> IMM_NODE_R_AVAILABLE Feb 19 10:35:49 PL-3 osafimmloadd: NO Sync starting Feb 19 10:35:49 PL-3 osafimmloadd: IN Synced 449 objects in total Feb 19 10:35:49 PL-3 osafimmnd[5076]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 17418 Feb 19 10:35:49 PL-3 osafimmnd[5076]: NO Epoch set to 6 in ImmModel Feb 19 10:35:49 PL-3 cpsv_app: IN Received PROC_STALE_CLIENTS Feb 19 10:35:49 PL-3 osafimmloadd: NO Sync ending normally Feb 19 10:35:49 PL-3 osafimmnd[5076]: NO Implementer connected: 24 (MsgQueueService131855) <51, 2030f> Feb 19 10:35:49 PL-3 osafimmnd[5076]: NO Implementer connected: 25 (MsgQueueService132111) <0, 2040f> Feb 19 10:35:49 PL-3 osafimmnd[5076]: NO SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY Feb 19 10:35:49 PL-3 osafimmnd[5076]: NO Implementer connected: 26 (safLogService) <0, 2020f> Feb 19 10:35:49 PL-3 osafckptnd[5115]: NO Bad CLM handle. Reinitializing. Feb 19 10:35:49 PL-3 osafmsgnd[5133]: ER saClmDispatch Failed with error 9 Feb 19 10:35:49 PL-3 osafimmnd[5076]: NO Implementer connected: 27 (safClmService) <0, 2020f> Feb 19 10:35:49 PL-3 osafclmna[5086]: NO safNode=PL-3,safCluster=myClmCluster Joined cluster, nodeid=2030f Feb 19 10:35:49 PL-3 osafamfnd[5095]: NO AVD NEW_ACTIVE, adest:1 Feb 19 10:35:49 PL-3 osafamfnd[5095]: NO Sending node up due to NCSMDS_NEW_ACTIVE Feb 19 10:35:49 PL-3 osafamfnd[5095]: NO 1 SISU states sent Feb 19 10:35:49 PL-3 osafamfnd[5095]: NO 1 SU states sent Feb 19 10:35:49 PL-3 osafamfnd[5095]: NO 7 CSICOMP states synced Feb 19 10:35:49 PL-3 osafamfnd[5095]: NO 7 SU states sent Feb 19 10:35:49 PL-3 osafimmnd[5076]: NO Implementer connected: 28 (safAmfService) <0, 2020f> Feb 19 10:35:58 PL-3 kernel: [ 2275.708728] tipc: Established link <1.1.3:eth1-1.1.1:eth1> on network plane B Feb 19 10:35:58 PL-3 kernel: [ 2275.710195] tipc: Established link <1.1.3:eth4-1.1.1:eth0> on network plane A Feb 19 10:36:00 PL-3 osafimmnd[5076]: NO Implementer connected: 29 (MsgQueueService131599) <0, 2020f> Feb 19 10:36:01 PL-3 osafimmnd[5076]: NO Implementer connected: 30 (safCheckPointService) <0, 2020f> Feb 19 10:36:01 PL-3 osafimmnd[5076]: NO Implementer disconnected 30 <0, 2020f> (safCheckPointService) Feb 19 10:36:01 PL-3 osafimmnd[5076]: NO Implementer connected: 31 (safMsgGrpService) <0, 2020f> Feb 19 10:36:01 PL-3 osafimmnd[5076]: NO Implementer connected: 32 (safCheckPointService) <0, 2020f> Feb 19 10:36:01 PL-3 osafckptnd[5115]: NO cpnd_proc_update_cpd_data::ckpt_name = safCkpt=all_collocated_ckpt_name_101[360] Feb 19 10:36:01 PL-3 osafckptnd[5115]: NO cpnd_proc_update_cpd_data::send CPD_EVT_ND2D_CKPT_INFO_UPDATE Feb 19 10:36:01 PL-3 osafckptnd[5115]: NO cpnd_proc_update_cpd_data::CPND_EVT_D2ND_CKPT_INFO_UPDATE_ACK received Feb 19 10:36:01 PL-3 osafimmnd[5076]: NO Implementer connected: 33 (safLckService) <0, 2020f> Feb 19 10:36:01 PL-3 osafimmnd[5076]: NO Implementer connected: 34 (safEvtService) <0, 2020f> Feb 19 10:36:01 PL-3 osafimmnd[5076]: NO Implementer connected: 35 (safSmfService) <0, 2020f> Feb 19 10:36:02 PL-3 osafimmnd[5076]: NO Announce sync, epoch:7 Feb 19 10:36:02 PL-3 osafimmnd[5076]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER Feb 19 10:36:02 PL-3 osafimmnd[5076]: NO NODE STATE-> IMM_NODE_R_AVAILABLE Feb 19 10:36:02 PL-3 osafimmloadd: NO Sync starting Feb 19 10:36:02 PL-3 osafimmloadd: IN Synced 447 objects in total Feb 19 10:36:02 PL-3 osafimmnd[5076]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 17418 Feb 19 10:36:02 PL-3 osafimmnd[5076]: NO Epoch set to 7 in ImmModel Feb 19 10:36:02 PL-3 osafimmloadd: NO Sync ending normally Feb 19 10:36:02 PL-3 osafimmnd[5076]: NO SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY Feb 19 10:36:02 PL-3 osafimmnd[5076]: NO Implementer (applier) connected: 36 (@safAmfService2010f) <0, 2010f> Feb 19 10:36:05 PL-3 osafimmnd[5076]: NO Implementer connected: 37 (MsgQueueService131343) <0, 2010f> Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 31 <0, 2020f> (safMsgGrpService) Feb 19 10:36:36 PL-3 osafimmnd[5076]: WA DISCARD DUPLICATE FEVS message:5198 Feb 19 10:36:36 PL-3 osafimmnd[5076]: WA Error code 2 returned for message type 82 - ignoring Feb 19 10:36:36 PL-3 osafimmnd[5076]: WA DISCARD DUPLICATE FEVS message:5199 Feb 19 10:36:36 PL-3 osafimmnd[5076]: WA Error code 2 returned for message type 82 - ignoring Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Global discard node received for nodeId:2020f pid:8958 Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 29 <0, 2020f(down)> (MsgQueueService131599) Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 35 <0, 2020f(down)> (safSmfService) Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 34 <0, 2020f(down)> (safEvtService) Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 33 <0, 2020f(down)> (safLckService) Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 32 <0, 2020f(down)> (safCheckPointService) Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 28 <0, 2020f(down)> (safAmfService) Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 27 <0, 2020f(down)> (safClmService) Feb 19 10:36:36 PL-3 osafimmnd[5076]: NO Implementer disconnected 26 <0, 2020f(down)> (safLogService) Feb 19 10:36:37 PL-3 kernel: [ 2314.221860] tipc: Resetting link <1.1.3:eth4-1.1.2:eth3>, changeover initiated by peer Feb 19 10:36:37 PL-3 kernel: [ 2314.221867] tipc: Lost link <1.1.3:eth4-1.1.2:eth3> on network plane A Feb 19 10:36:39 PL-3 kernel: [ 2316.512074] tipc: Resetting link <1.1.3:eth1-1.1.2:eth2>, peer not responding Feb 19 10:36:39 PL-3 kernel: [ 2316.512080] tipc: Lost link <1.1.3:eth1-1.1.2:eth2> on network plane B Feb 19 10:36:39 PL-3 kernel: [ 2316.512082] tipc: Lost contact with <1.1.2> Feb 19 10:36:39 PL-3 osafamfnd[5095]: NO AVD NEW_ACTIVE, adest:1 Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer disconnected 36 <0, 2010f> (@safAmfService2010f) Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 38 (safAmfService) <0, 2010f> Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 39 (safLogService) <0, 2010f> Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 40 (safMsgGrpService) <0, 2010f> Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 41 (safCheckPointService) <0, 2010f> Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 42 (MsgQueueService131599) <0, 2010f> Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 43 (safLckService) <0, 2010f> Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 44 (safClmService) <0, 2010f> Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 45 (safEvtService) <0, 2010f> Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer disconnected 42 <0, 2010f> (MsgQueueService131599) Feb 19 10:36:39 PL-3 osafimmnd[5076]: NO Implementer connected: 46 (safSmfService) <0, 2010f> Feb 19 10:36:43 PL-3 osafimmnd[5076]: NO Implementer disconnected 40 <0, 2010f> (safMsgGrpService) Feb 19 10:36:43 PL-3 osafimmnd[5076]: NO Implementer disconnected 41 <0, 2010f> (safCheckPointService) Feb 19 10:36:43 PL-3 osafimmnd[5076]: NO Implementer disconnected 37 <0, 2010f> (MsgQueueService131343) Feb 19 10:36:43 PL-3 osafimmnd[5076]: WA SC Absence IS allowed:900 IMMD service is DOWN Feb 19 10:36:43 PL-3 osafimmnd[5076]: WA This IMMND coord has to exit allowing restarted IMMD to select new coord Feb 19 10:36:43 PL-3 osafamfnd[5095]: NO 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' component restart probation timer started (timeout: 60000000000 ns) Feb 19 10:36:43 PL-3 osafamfnd[5095]: NO Restarting a component of 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1) Feb 19 10:36:43 PL-3 osafamfnd[5095]: NO 'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart' Feb 19 10:36:43 PL-3 osafimmnd[5605]: Started Feb 19 10:36:44 PL-3 osafamfnd[5095]: WA AMF director unexpectedly crashed Feb 19 10:36:44 PL-3 osafamfnd[5095]: NO Checking 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages ========================================================================================================================================================= -AVM On 2/4/2016 3:11 PM, Hung Nguyen wrote: Hi Zoran, Please find my comment inline. BR, Hung Nguyen - DEK Technologies -------------------------------------------------------------------------------- From: Zoran Milinkovic zoran.milinko...@ericsson.com<mailto:zoran.milinko...@ericsson.com> Sent: Tuesday, December 22, 2015 9:14PM To: Neelakanta Reddy reddy.neelaka...@oracle.com<mailto:reddy.neelaka...@oracle.com> Cc: Opensaf-devel opensaf-devel@lists.sourceforge.net<mailto:opensaf-devel@lists.sourceforge.net> Subject: [devel] [PATCH 4 of 5] imm: add IMMND support for cloud resilience feature [#1625] osaf/services/saf/immsv/immnd/ImmModel.cc | 115 ++++++++++++++++++++ osaf/services/saf/immsv/immnd/ImmModel.hh | 9 +- osaf/services/saf/immsv/immnd/immnd_cb.h | 11 +- osaf/services/saf/immsv/immnd/immnd_evt.c | 166 ++++++++++++++++++++++++---- osaf/services/saf/immsv/immnd/immnd_init.h | 13 ++- osaf/services/saf/immsv/immnd/immnd_main.c | 7 + osaf/services/saf/immsv/immnd/immnd_proc.c | 120 ++++++++++++++++---- 7 files changed, 381 insertions(+), 60 deletions(-) The patch contains IMMND code that is needed for supporting cloud resilience feature. diff --git a/osaf/services/saf/immsv/immnd/ImmModel.cc b/osaf/services/saf/immsv/immnd/ImmModel.cc --- a/osaf/services/saf/immsv/immnd/ImmModel.cc +++ b/osaf/services/saf/immsv/immnd/ImmModel.cc @@ -446,6 +446,7 @@ static const std::string immPbeBSlaveNam static const std::string immLongDnsAllowed(OPENSAF_IMM_LONG_DNS_ALLOWED); static const std::string immAccessControlMode(OPENSAF_IMM_ACCESS_CONTROL_MODE); static const std::string immAuthorizedGroup(OPENSAF_IMM_AUTHORIZED_GROUP); +static const std::string immScAbsenceAllowed(OPENSAF_IMM_SC_ABSENCE_ALLOWED); static const std::string immMngtClass("SaImmMngt"); static const std::string immManagementDn("safRdn=immManagement,safApp=safImmService"); @@ -492,6 +493,17 @@ struct CcbIdIs }; +void +immModel_setScAbsenceAllowed(IMMND_CB *cb) +{ + if(cb->mCanBeCoord == 4) { + osafassert(cb->mScAbsenceAllowed > 0); + } else { + osafassert(cb->mScAbsenceAllowed == 0); + } + ImmModel::instance(&cb->immModel)->setScAbsenceAllowed(cb->mScAbsenceAllowed); +} + SaAisErrorT immModel_ccbResult(IMMND_CB *cb, SaUint32T ccbId) { @@ -511,6 +523,32 @@ immModel_abortSync(IMMND_CB *cb) } void +immModel_isolateThisNode(IMMND_CB *cb) +{ + ImmModel::instance(&cb->immModel)->isolateThisNode(cb->node_id, cb->mIsCoord); +} + +void +immModel_abortNonCriticalCcbs(IMMND_CB *cb) +{ + SaUint32T arrSize; + SaUint32T* implConnArr = NULL; + SaUint32T client; + SaClmNodeIdT pbeNodeId; + SaUint32T nodeId; + CcbVector::iterator i3 = sCcbVector.begin(); + for(; i3!=sCcbVector.end(); ++i3) { + if((*i3)->mState < IMM_CCB_CRITICAL) { + osafassert(immModel_ccbAbort(cb, (*i3)->mId, &arrSize, &implConnArr, &client, &nodeId, &pbeNodeId)); + osafassert(immModel_ccbFinalize(cb, (*i3)->mId) == SA_AIS_OK); + if (arrSize) { + free(implConnArr); + } + } + } +} + +void immModel_pbePrtoPurgeMutations(IMMND_CB *cb, SaUint32T nodeId, SaUint32T *reqArrSize, SaUint32T **reqConnArr) { @@ -17171,6 +17209,27 @@ ImmModel::getParentDn(std::string& paren TRACE_LEAVE(); } +void +ImmModel::setScAbsenceAllowed(SaUint16T scAbsenceAllowed) +{ + ObjectMap::iterator oi = sObjectMap.find(immObjectDn); + osafassert(oi != sObjectMap.end()); + ObjectInfo* immObject = oi->second; + ImmAttrValueMap::iterator avi = + immObject->mAttrValueMap.find(immScAbsenceAllowed); + if(avi == immObject->mAttrValueMap.end()) { + LOG_WA("Attribue '%s' does not exist in object '%s'", + immScAbsenceAllowed.c_str(), immObjectDn.c_str()); + return; + } + + osafassert(!(avi->second->isMultiValued())); + ImmAttrValue* valuep = (ImmAttrValue *) avi->second; + valuep->setValue_int(scAbsenceAllowed); + + LOG_NO("ABT ImmModel received scAbsenceAllowed %u", scAbsenceAllowed); +} + SaAisErrorT ImmModel::finalizeSync(ImmsvOmFinalizeSync* req, bool isCoord, bool isSyncClient) @@ -18067,3 +18126,59 @@ ImmModel::finalizeSync(ImmsvOmFinalizeSy return err; } +void +ImmModel::isolateThisNode(unsigned int thisNode, bool isAtCoord) +{ + /* Move this logic up to immModel_isolate... No need for this extra level. + But need to abort and terminate ccbs. + */ + ImplementerVector::iterator i; + AdminOwnerVector::iterator i2; + CcbVector::iterator i3; + unsigned int otherNode; + + if((sImmNodeState != IMM_NODE_FULLY_AVAILABLE) && (sImmNodeState != IMM_NODE_R_AVAILABLE)) { + LOG_NO("SC abscence interrupted sync of this IMMND - exiting"); + exit(0); + } + + i = sImplementerVector.begin(); + while(i != sImplementerVector.end()) { + IdVector cv, gv; + ImplementerInfo* info = (*i); + otherNode = info->mNodeId; + if(otherNode == thisNode || otherNode == 0) { + i++; + } else { + info = NULL; + this->discardNode(otherNode, cv, gv, isAtCoord); + LOG_NO("Impl Discarded node %x", otherNode); + /* Discard ccbs. */ + + i = sImplementerVector.begin(); /* restart iteration. */ + } + } + + i2 = sOwnerVector.begin(); + while(i2 != sOwnerVector.end()) { + IdVector cv, gv; + AdminOwnerInfo* ainfo = (*i2); + otherNode = ainfo->mNodeId; + if(otherNode == thisNode || otherNode == 0) { + /* ??? (otherNode == 0) is that really correct ??? */ + i2++; + } else { + ainfo = NULL; + this->discardNode(otherNode, cv, gv, isAtCoord); + LOG_NO("Admo Discarded node %x", otherNode); + /* Discard ccbs */ + + i2 = sOwnerVector.begin(); /* restart iteration. */ + } + } + + /* Verify that all noncritical CCBs are aborted. + Ccbs where client resided at this node chould already have been handled in + immnd_proc_discard_other_nodes() that calls immnd_proc_imma_discard_connection() + */ +} diff --git a/osaf/services/saf/immsv/immnd/ImmModel.hh b/osaf/services/saf/immsv/immnd/ImmModel.hh --- a/osaf/services/saf/immsv/immnd/ImmModel.hh +++ b/osaf/services/saf/immsv/immnd/ImmModel.hh @@ -145,12 +145,6 @@ public: const immsv_octet_string* clName, ImmsvOmClassDescr* res); - SaAisErrorT classSerialize( - const char* className, - char** data, - size_t* size); - - SaAisErrorT attrCreate( ClassInfo* classInfo, const ImmsvAttrDefinition* attr, @@ -480,6 +474,8 @@ public: const struct ImmsvAdminOperationParam *reqparams, struct ImmsvAdminOperationParam **rparams, SaUint64T searchcount); + + void setScAbsenceAllowed(SaUint16T scAbsenceAllowed); SaAisErrorT objectSync(const ImmsvOmObjectSync* req); bool fetchRtUpdate(ImmsvOmObjectSync* syncReq, @@ -517,6 +513,7 @@ public: void recognizedIsolated(); bool syncComplete(bool isJoining); void abortSync(); + void isolateThisNode(unsigned int thisNode, bool isAtCoord); void pbePrtoPurgeMutations(unsigned int nodeId, ConnVector& connVector); SaAisErrorT ccbResult(SaUint32T ccbId); ImmsvAttrNameList * ccbGrabErrStrings(SaUint32T ccbId); diff --git a/osaf/services/saf/immsv/immnd/immnd_cb.h b/osaf/services/saf/immsv/immnd/immnd_cb.h --- a/osaf/services/saf/immsv/immnd/immnd_cb.h +++ b/osaf/services/saf/immsv/immnd/immnd_cb.h @@ -113,13 +113,17 @@ typedef struct immnd_cb_tag { SaUint32T mMyEpoch; //Epoch counter, used in synch of immnds SaUint32T mMyPid; //Is this needed ?? SaUint32T mRulingEpoch; - uint8_t mAccepted; //Should all fevs messages be processed? + SaUint32T mLatestAdmoId; + SaUint32T mLatestImplId; + SaUint32T mLatestCcbId; + + uint8_t mAccepted; //If=!0 Fevs messages can be processed. 2=>IMMD re-introduce. uint8_t mIntroduced; //Ack received on introduce message uint8_t mSyncRequested; //true=> I am coord, other req sync uint8_t mPendSync; //1=>sync announced but not received. uint8_t mSyncFinalizing; //1=>finalizeSync sent but not received. uint8_t mSync; //true => this node is being synced (client). - uint8_t mCanBeCoord; //If!=0 then SC, if 2 the 2pbe arbitration. + uint8_t mCanBeCoord; //If!=0 then SC, 2 => 2pbe arbitration, 4 => absentScAllowed. uint8_t mIsCoord; uint8_t mLostNodes; //Detached & not syncreq => delay sync start uint8_t mBlockPbeEnable; //Current PBE has not completed shutdown yet. @@ -128,6 +132,8 @@ typedef struct immnd_cb_tag { bool mIsOtherScUp; //If set & this is an SC then other SC is up(2pbe). //False=> *allow* 1safe 2pbe. May err conservatively (true) bool mForceClean; //true => Force cleanTheHouse to run once *now*. + SaUint16T mScAbsenceAllowed; /* Non zero if "headless Hydra" allowed (loss of both IMMDs/SCs). + Value is number of seconds of SC absence tolerated. */ /* Information about the IMMD */ MDS_DEST immd_mdest_id; @@ -161,6 +167,7 @@ typedef struct immnd_cb_tag { uint8_t mPbeVeteran; //false => regenerate. true => re-attach db-file uint8_t mPbeVeteranB; //false => regenerate. true => re-attach db-file uint8_t mPbeOldVeteranB; //false => restarted, true => stable. (only to reduce logging). + uint8_t mPbeUsesSharedFs; //false => not use SFS, true => use SFS SaAmfHAStateT ha_state; // present AMF HA state of the component EDU_HDL immnd_edu_hdl; // edu handle, obscurely needed by mds. diff --git a/osaf/services/saf/immsv/immnd/immnd_evt.c b/osaf/services/saf/immsv/immnd/immnd_evt.c --- a/osaf/services/saf/immsv/immnd/immnd_evt.c +++ b/osaf/services/saf/immsv/immnd/immnd_evt.c @@ -75,9 +75,9 @@ static void immnd_evt_proc_admo_finalize IMMND_EVT *evt, SaBoolT originatedAtThisNd, SaImmHandleT clnt_hdl, MDS_DEST reply_dest); -static void immnd_evt_proc_admo_hard_finalize(IMMND_CB *cb, - IMMND_EVT *evt, - SaBoolT originatedAtThisNd, SaImmHandleT clnt_hdl, MDS_DEST reply_dest); +//static void immnd_evt_proc_admo_hard_finalize(IMMND_CB *cb, +// IMMND_EVT *evt, +// SaBoolT originatedAtThisNd, SaImmHandleT clnt_hdl, MDS_DEST reply_dest); static void immnd_evt_proc_admo_set(IMMND_CB *cb, IMMND_EVT *evt, @@ -1515,7 +1515,7 @@ static uint32_t immnd_evt_proc_search_ne on a previous syncronous call. Discard the connection and return BAD_HANDLE to allow client to recover and make progress. */ - immnd_proc_imma_discard_connection(cb, cl_node); + immnd_proc_imma_discard_connection(cb, cl_node, false); rc = immnd_client_node_del(cb, cl_node); osafassert(rc == NCSCC_RC_SUCCESS); free(cl_node); @@ -1973,7 +1973,7 @@ static uint32_t immnd_evt_proc_imm_final goto agent_rsp; } - immnd_proc_imma_discard_connection(cb, cl_node); + immnd_proc_imma_discard_connection(cb, cl_node, false); rc = immnd_client_node_del(cb, cl_node); if (rc == NCSCC_RC_FAILURE) { @@ -2197,9 +2197,11 @@ static uint32_t immnd_evt_proc_imm_clien cl_node->mIsResurrect = 0x1; if (immnd_client_node_add(cb, cl_node) != NCSCC_RC_SUCCESS) { +#if 0 //CLOUD-PROTO ABT clients should be discarded !!!! LOG_ER("IMMND - Adding temporary imma client Failed."); /*free(cl_node);*/ abort(); +#endif } TRACE_2("Added client with id: %llx <node:%x, count:%u>", @@ -2314,7 +2316,7 @@ static uint32_t immnd_evt_proc_admowner_ on a previous syncronous call. Discard the connection and return BAD_HANDLE to allow client to recover and make progress. */ - immnd_proc_imma_discard_connection(cb, cl_node); + immnd_proc_imma_discard_connection(cb, cl_node, false); rc = immnd_client_node_del(cb, cl_node); osafassert(rc == NCSCC_RC_SUCCESS); free(cl_node); @@ -2442,7 +2444,7 @@ static uint32_t immnd_evt_proc_impl_set( on a previous syncronous call. Discard the connection and return BAD_HANDLE to allow client to recover and make progress. */ - immnd_proc_imma_discard_connection(cb, cl_node); + immnd_proc_imma_discard_connection(cb, cl_node, false); rc = immnd_client_node_del(cb, cl_node); osafassert(rc == NCSCC_RC_SUCCESS); free(cl_node); @@ -2573,7 +2575,7 @@ static uint32_t immnd_evt_proc_ccb_init( on a previous syncronous call. Discard the connection and return BAD_HANDLE to allow client to recover and make progress. */ - immnd_proc_imma_discard_connection(cb, cl_node); + immnd_proc_imma_discard_connection(cb, cl_node, false); rc = immnd_client_node_del(cb, cl_node); osafassert(rc == NCSCC_RC_SUCCESS); free(cl_node); @@ -2680,7 +2682,7 @@ static uint32_t immnd_evt_proc_rt_update on a previous syncronous call. Discard the connection and return BAD_HANDLE to allow client to recover and make progress. */ - immnd_proc_imma_discard_connection(cb, cl_node); + immnd_proc_imma_discard_connection(cb, cl_node, false); rc = immnd_client_node_del(cb, cl_node); osafassert(rc == NCSCC_RC_SUCCESS); free(cl_node); @@ -2866,7 +2868,7 @@ static uint32_t immnd_evt_proc_fevs_forw on a previous syncronous call. Discard the connection and return BAD_HANDLE to allow client to recover and make progress. */ - immnd_proc_imma_discard_connection(cb, cl_node); + immnd_proc_imma_discard_connection(cb, cl_node, false); rc = immnd_client_node_del(cb, cl_node); osafassert(rc == NCSCC_RC_SUCCESS); free(cl_node); @@ -8317,7 +8319,7 @@ uint32_t immnd_evt_proc_abort_sync(IMMND if (cb->mState == IMM_SERVER_SYNC_CLIENT || cb->mState == IMM_SERVER_SYNC_PENDING) { /* Sync client will have to restart the sync */ cb->mState = IMM_SERVER_LOADING_PENDING; - LOG_WA("SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM SERVER LOADING PENDING (sync aborted)"); + LOG_WA("SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM_SERVER_LOADING_PENDING (sync aborted)"); cb->mStep = 0; cb->mJobStart = time(NULL); osafassert(cb->mJobStart >= ((time_t) 0)); @@ -8451,6 +8453,7 @@ static uint32_t immnd_evt_proc_start_syn with respect to the just arriving start-sync. Search for "ticket:#598" in immnd_proc.c */ + immModel_setScAbsenceAllowed(cb); } else if ((cb->mState == IMM_SERVER_SYNC_CLIENT) && (immnd_syncComplete(cb, SA_FALSE, cb->mStep))) { cb->mStep = 0; cb->mJobStart = time(NULL); @@ -8467,6 +8470,7 @@ static uint32_t immnd_evt_proc_start_syn with respect to the just arriving start-sync. Search for "ticket:#599" in immnd_proc.c */ + immModel_setScAbsenceAllowed(cb); } cb->mRulingEpoch = evt->info.ctrl.rulingEpoch; @@ -8543,7 +8547,7 @@ static uint32_t immnd_evt_proc_start_syn static uint32_t immnd_evt_proc_reset(IMMND_CB *cb, IMMND_EVT *evt, IMMSV_SEND_INFO *sinfo) { TRACE_ENTER(); - if (cb->mIntroduced) { + if (cb->mIntroduced==1) { LOG_ER("IMMND forced to restart on order from IMMD, exiting"); if(cb->mState < IMM_SERVER_READY) { immnd_ackToNid(NCSCC_RC_FAILURE); @@ -8668,11 +8672,15 @@ static uint32_t immnd_evt_proc_intro_rsp evt->info.ctrl.nodeId != cb->node_id); cb->mNumNodes++; TRACE("immnd_evt_proc_intro_rsp cb->mNumNodes: %u", cb->mNumNodes); + LOG_IN("immnd_evt_proc_intro_rsp: epoch:%i rulingEpoch:%u", cb->mMyEpoch, evt->info.ctrl.rulingEpoch); + if(evt->info.ctrl.rulingEpoch > cb->mRulingEpoch) { + cb->mRulingEpoch = evt->info.ctrl.rulingEpoch; + } if (evt->info.ctrl.nodeId == cb->node_id) { /*This node was introduced to the IMM cluster */ uint8_t oldCanBeCoord = cb->mCanBeCoord; - cb->mIntroduced = true; + cb->mIntroduced = 1; if(evt->info.ctrl.canBeCoord == 3) { cb->m2Pbe = 1; evt->info.ctrl.canBeCoord = 1; @@ -8708,6 +8716,14 @@ static uint32_t immnd_evt_proc_intro_rsp ((oldCanBeCoord == 2)?"load":"sync")); } + if(cb->mCanBeCoord == 4) { + osafassert(!(cb->m2Pbe)); + cb->mScAbsenceAllowed = evt->info.ctrl.ndExecPid; + LOG_IN("ABT cb->mScAbsenceAllowed:%u evt->info.ctrl.ndExecPid:%u", cb->mScAbsenceAllowed, evt->info.ctrl.ndExecPid); + LOG_IN("SC_ABSENCE_ALLOWED (Headless Hydra) is configured for %u seconds. CanBeCoord:%u", + cb->mScAbsenceAllowed, cb->mCanBeCoord); + } + if (evt->info.ctrl.isCoord) { if (cb->mIsCoord) { LOG_NO("This IMMND re-elected coord redundantly, failover ?"); @@ -8733,7 +8749,14 @@ static uint32_t immnd_evt_proc_intro_rsp } } - cb->mIsCoord = evt->info.ctrl.isCoord; + if(cb->mIsCoord) { + if(!(evt->info.ctrl.isCoord)) { + LOG_NO("ABT CLOUD PROTO avoided canceling coord - SHOULD NOT GET HERE"); + } + } else { + LOG_NO("SETTING COORD TO %u CLOUD PROTO", evt->info.ctrl.isCoord); + cb->mIsCoord = evt->info.ctrl.isCoord; + } osafassert(!cb->mIsCoord || cb->mCanBeCoord); cb->mRulingEpoch = evt->info.ctrl.rulingEpoch; if (cb->mRulingEpoch) { @@ -8751,7 +8774,7 @@ static uint32_t immnd_evt_proc_intro_rsp */ if(cb->mCanBeCoord && evt->info.ctrl.canBeCoord) { - LOG_IN("Other SC node (%x) has been introduced", evt->info.ctrl.nodeId); + LOG_IN("Other %s IMMND node (%x) has been introduced", (cb->mScAbsenceAllowed)?"candidate coord":"SC", evt->info.ctrl.nodeId); cb->mIsOtherScUp = true; /* Prevents oneSafe2PBEAllowed from being turned on */ cb->other_sc_node_id = evt->info.ctrl.nodeId; @@ -9066,7 +9089,9 @@ static void immnd_evt_proc_adminit_rsp(I SaUint32T conn; SaUint32T ownerId = 0; - osafassert(evt); + /* Remember latest admo_id for IMMD recovery. */ + cb->mLatestAdmoId = evt->info.adminitGlobal.globalOwnerId; + conn = m_IMMSV_UNPACK_HANDLE_HIGH(clnt_hdl); nodeId = m_IMMSV_UNPACK_HANDLE_LOW(clnt_hdl); ownerId = evt->info.adminitGlobal.globalOwnerId; @@ -9231,6 +9256,45 @@ static void immnd_evt_proc_finalize_sync /*This adjust-epoch will persistify the new epoch for: veterans. */ immnd_adjustEpoch(cb, SA_TRUE); /* Will osafassert if immd is down. */ } + + if(cb->mScAbsenceAllowed) {/* Coord and veteran nodes. */ + IMMND_IMM_CLIENT_NODE *cl_node = NULL; + SaImmHandleT prev_hdl; + unsigned int count = 0; + IMMSV_EVT send_evt; + /* Sync completed for veteran & headless allowed => trigger active + resurrect. */ + memset(&send_evt, '\0', sizeof(IMMSV_EVT)); + send_evt.type = IMMSV_EVT_TYPE_IMMA; + send_evt.info.imma.type = IMMA_EVT_ND2A_PROC_STALE_CLIENTS; + immnd_client_node_getnext(cb, 0, &cl_node); + while (cl_node) { + prev_hdl = cl_node->imm_app_hdl; + if(!(cl_node->mIsResurrect)) { + LOG_IN("Veteran node found active client id: %llx " + "version:%c %u %u, after sync.", + cl_node->imm_app_hdl, cl_node->version.releaseCode, + cl_node->version.majorVersion, + cl_node->version.minorVersion); + immnd_client_node_getnext(cb, prev_hdl, &cl_node); + continue; + } + /* Send resurrect message. */ + if (immnd_mds_msg_send(cb, cl_node->sv_id, + cl_node->agent_mds_dest, &send_evt)!=NCSCC_RC_SUCCESS) + { + LOG_WA("Failed to send active resurrect message"); + } + /* Remove the temporary client node. */ + immnd_client_node_del(cb, cl_node); + memset(cl_node, '\0', sizeof(IMMND_IMM_CLIENT_NODE)); + free(cl_node); + cl_node = NULL; + ++count; + immnd_client_node_getnext(cb, 0, &cl_node); + } + TRACE_2("Triggered %u active resurrects at veteran node", count); + } } done: @@ -9485,7 +9549,7 @@ static void immnd_evt_proc_admo_finalize * is to be sent (only relevant if * originatedAtThisNode is false). *****************************************************************************/ -static void immnd_evt_proc_admo_hard_finalize(IMMND_CB *cb, +void immnd_evt_proc_admo_hard_finalize(IMMND_CB *cb, IMMND_EVT *evt, SaBoolT originatedAtThisNd, SaImmHandleT clnt_hdl, MDS_DEST reply_dest) { @@ -9550,6 +9614,9 @@ static void immnd_evt_proc_impl_set_rsp( evt->info.implSet.oi_timeout = 0; } + /* Remember latest impl_id for IMMD recovery. */ + cb->mLatestImplId = evt->info.implSet.impl_id; + err = immModel_implementerSet(cb, &(evt->info.implSet.impl_name), (originatedAtThisNd) ? conn : 0, nodeId, implId, reply_dest, evt->info.implSet.oi_timeout, &discardImplementer); @@ -9934,6 +10001,9 @@ static void immnd_evt_proc_ccbinit_rsp(I nodeId = m_IMMSV_UNPACK_HANDLE_LOW(clnt_hdl); ccbId = evt->info.ccbinitGlobal.globalCcbId; + /* Remember latest ccb_id for IMMD recovery. */ + cb->mLatestCcbId = evt->info.ccbinitGlobal.globalCcbId; + err = immModel_ccbCreate(cb, evt->info.ccbinitGlobal.i.adminOwnerId, evt->info.ccbinitGlobal.i.ccbFlags, @@ -10053,12 +10123,61 @@ static uint32_t immnd_evt_proc_mds_evt(I immnd_proc_imma_down(cb, evt->info.mds_info.dest, evt->info.mds_info.svc_id); } else if ((evt->info.mds_info.change == NCSMDS_DOWN) && evt->info.mds_info.svc_id == NCSMDS_SVC_ID_IMMD) { /* Cluster is going down. */ - LOG_NO("No IMMD service => cluster restart, exiting"); - if(cb->mState < IMM_SERVER_SYNC_SERVER) { - immnd_ackToNid(NCSCC_RC_FAILURE); - } - exit(1); - + if(cb->mScAbsenceAllowed == 0) { + /* Regular (non Hydra) exit on IMMD DOWN. */ + LOG_ER("No IMMD service => cluster restart, exiting"); + if(cb->mState < IMM_SERVER_SYNC_SERVER) { + immnd_ackToNid(NCSCC_RC_FAILURE); + } + exit(1); + } else { /* SC ABSENCE ALLOWED */ + LOG_WA("SC Absence IS allowed:%u IMMD service is DOWN", cb->mScAbsenceAllowed); + if(cb->mIsCoord) { + /* Note that normally the coord will reside at SCs so this branch will + only be relevant if REPEATED toal scAbsence occurs. After SC absence + and subsequent return of SC, the coord will be elected at a payload. + That coord will be active untill restart of that payload.. + unless we add functionality for the payload coord to restart after + a few minutes .. ? + */ + LOG_WA("This IMMND coord has to exit allowing restarted IMMD to select new coord"); + if(cb->mState < IMM_SERVER_SYNC_SERVER) { + immnd_ackToNid(NCSCC_RC_FAILURE); + } + exit(1); + } else if(cb->mState <= IMM_SERVER_LOADING_PENDING) { + /* Reset state in payloads that had not joined. No need to restart. */ + LOG_IN("Resetting IMMND state from %u to IMM_SERVER_ANONYMOUS", cb->mState); + cb->mState = IMM_SERVER_ANONYMOUS; + } else if(cb->mState < IMM_SERVER_READY) { + LOG_WA("IMMND was being synced or loaded (%u), has to restart", cb->mState); + if(cb->mState < IMM_SERVER_SYNC_SERVER) { + immnd_ackToNid(NCSCC_RC_FAILURE); + } + exit(1); + } + } + cb->mIntroduced = 2; + LOG_NO("IMMD SERVICE IS DOWN, HYDRA IS CONFIGURED => UNREGISTERING IMMND form MDS"); + immnd_mds_unregister(cb); + /* Discard local clients ... */ + immnd_proc_discard_other_nodes(cb); /* Isolate from the rest of cluster */ + LOG_NO("MDS unregisterede. sleeping ..."); + sleep(1); + LOG_NO("Sleep done registering IMMND with MDS"); + rc = immnd_mds_register(immnd_cb); + if(rc == NCSCC_RC_SUCCESS) { + LOG_NO("SUCCESS IN REGISTERING IMMND WITH MDS"); + } else { + LOG_ER("FAILURE IN REGISTERING IMMND WITH MDS - exiting"); + exit(1); + } + } else if ((evt->info.mds_info.change == NCSMDS_UP) && (evt->info.mds_info.svc_id == NCSMDS_SVC_ID_IMMD)) { + LOG_NO("IMMD service is UP ... ScAbsenseAllowed?:%u introduced?:%u", + cb->mScAbsenceAllowed, cb->mIntroduced); + if((cb->mIntroduced==2) && (immnd_introduceMe(cb) != NCSCC_RC_SUCCESS)) { + LOG_WA("IMMND re-introduceMe after IMMD restart failed, will retry"); + } } else if ((evt->info.mds_info.change == NCSMDS_UP) && (evt->info.mds_info.svc_id == NCSMDS_SVC_ID_IMMA_OM || evt->info.mds_info.svc_id == NCSMDS_SVC_ID_IMMA_OM)) { @@ -10073,7 +10192,6 @@ static uint32_t immnd_evt_proc_mds_evt(I TRACE_2("IMMD FAILOVER"); /* The IMMD has failed over. */ immnd_proc_imma_discard_stales(cb); - } else if (evt->info.mds_info.svc_id == NCSMDS_SVC_ID_IMMND) { LOG_NO("MDS SERVICE EVENT OF TYPE IMMND!!"); } diff --git a/osaf/services/saf/immsv/immnd/immnd_init.h b/osaf/services/saf/immsv/immnd/immnd_init.h --- a/osaf/services/saf/immsv/immnd/immnd_init.h +++ b/osaf/services/saf/immsv/immnd/immnd_init.h @@ -39,8 +39,10 @@ extern IMMND_CB *immnd_cb; /* file : - immnd_proc.c */ +void immnd_proc_discard_other_nodes(IMMND_CB *cb); + void immnd_proc_imma_down(IMMND_CB *cb, MDS_DEST dest, NCSMDS_SVC_ID sv_id); -uint32_t immnd_proc_imma_discard_connection(IMMND_CB *cb, IMMND_IMM_CLIENT_NODE *cl_node); +uint32_t immnd_proc_imma_discard_connection(IMMND_CB *cb, IMMND_IMM_CLIENT_NODE *cl_node, bool scAbsenceAllowed); void immnd_proc_imma_discard_stales(IMMND_CB *cb); void immnd_cb_dump(void); @@ -75,6 +77,10 @@ extern "C" { void immModel_abortSync(IMMND_CB *cb); + void immModel_isolateThisNode(IMMND_CB *cb); + + void immModel_abortNonCriticalCcbs(IMMND_CB *cb); + void immModel_pbePrtoPurgeMutations(IMMND_CB *cb, unsigned int nodeId, SaUint32T *reqArrSize, SaUint32T **reqConArr); @@ -433,6 +439,8 @@ extern "C" { const char *errorString, ...); + void immModel_setScAbsenceAllowed(IMMND_CB *cb); + #ifdef __cplusplus } #endif @@ -471,6 +479,9 @@ uint32_t immnd_mds_get_handle(IMMND_CB * /* File : ---- immnd_evt.c */ void immnd_process_evt(void); uint32_t immnd_evt_destroy(IMMSV_EVT *evt, SaBoolT onheap, uint32_t line); +void immnd_evt_proc_admo_hard_finalize(IMMND_CB *cb, IMMND_EVT *evt, + SaBoolT originatedAtThisNd, SaImmHandleT clnt_hdl, MDS_DEST reply_dest); + /* End : ---- immnd_evt.c */ /* File : ---- immnd_proc.c */ diff --git a/osaf/services/saf/immsv/immnd/immnd_main.c b/osaf/services/saf/immsv/immnd/immnd_main.c --- a/osaf/services/saf/immsv/immnd/immnd_main.c +++ b/osaf/services/saf/immsv/immnd/immnd_main.c @@ -169,6 +169,13 @@ static uint32_t immnd_initialize(char *p immnd_cb->mPbeFile); } + if ((envVar = getenv("IMMSV_USE_SHARED_FS"))) { + int useSharedFs = atoi(envVar); + if(useSharedFs != 0) { + immnd_cb->mPbeUsesSharedFs = 1; + } + } + immnd_cb->mRim = SA_IMM_INIT_FROM_FILE; immnd_cb->mPbeVeteran = SA_FALSE; immnd_cb->mPbeVeteranB = SA_FALSE; diff --git a/osaf/services/saf/immsv/immnd/immnd_proc.c b/osaf/services/saf/immsv/immnd/immnd_proc.c --- a/osaf/services/saf/immsv/immnd/immnd_proc.c +++ b/osaf/services/saf/immsv/immnd/immnd_proc.c @@ -34,6 +34,7 @@ #include "immnd.h" #include "immsv_api.h" +#include "immnd_init.h" static const char *loaderBase = "osafimmloadd"; static const char *pbeBase = "osafimmpbed"; @@ -76,7 +77,7 @@ void immnd_proc_immd_down(IMMND_CB *cb) * Notes : Policy used for handling immd down is to blindly cleanup * :immnd_cb ****************************************************************************/ -uint32_t immnd_proc_imma_discard_connection(IMMND_CB *cb, IMMND_IMM_CLIENT_NODE *cl_node) +uint32_t immnd_proc_imma_discard_connection(IMMND_CB *cb, IMMND_IMM_CLIENT_NODE *cl_node, bool scAbsence) { SaUint32T client_id; SaUint32T node_id; @@ -129,7 +130,8 @@ uint32_t immnd_proc_imma_discard_connect send_evt.type = IMMSV_EVT_TYPE_IMMD; send_evt.info.immd.type = IMMD_EVT_ND2D_DISCARD_IMPL; send_evt.info.immd.info.impl_set.r.impl_id = implId; - if (immnd_mds_msg_send(cb, NCSMDS_SVC_ID_IMMD, cb->immd_mdest_id, &send_evt) != NCSCC_RC_SUCCESS) { + + if (!scAbsence && immnd_mds_msg_send(cb, NCSMDS_SVC_ID_IMMD, cb->immd_mdest_id, &send_evt) != NCSCC_RC_SUCCESS) { if (immnd_is_immd_up(cb)) { LOG_ER("Discard implementer failed for implId:%u " "but IMMD is up !? - case not handled. Client will be orphanded", implId); @@ -142,7 +144,8 @@ uint32_t immnd_proc_imma_discard_connect /*Discard the local implementer directly and redundantly to avoid race conditions using this implementer (ccb's causing abort upcalls). */ - immModel_discardImplementer(cb, implId, SA_FALSE, NULL, NULL); + //immModel_discardImplementer(cb, implId, SA_FALSE, NULL, NULL); + immModel_discardImplementer(cb, implId, scAbsence, NULL, NULL); } if (cl_node->mIsStale) { @@ -163,7 +166,7 @@ uint32_t immnd_proc_imma_discard_connect for (ix = 0; ix < arrSize && !(cl_node->mIsStale); ++ix) { send_evt.info.immd.info.ccbId = idArr[ix]; TRACE_5("Discarding Ccb id:%u originating at dead connection: %u", idArr[ix], client_id); - if (immnd_mds_msg_send(cb, NCSMDS_SVC_ID_IMMD, cb->immd_mdest_id, + if (!scAbsence && immnd_mds_msg_send(cb, NCSMDS_SVC_ID_IMMD, cb->immd_mdest_id, [Hung] We don't need this ... &send_evt) != NCSCC_RC_SUCCESS) { if (immnd_is_immd_up(cb)) { LOG_ER("Failure to broadcast discard Ccb for ccbId:%u " @@ -174,6 +177,8 @@ uint32_t immnd_proc_imma_discard_connect "(immd down)- will retry later", idArr[ix]); } cl_node->mIsStale = true; + } else if(scAbsence) { + /* ABT TODO discard local ccbs ??*/ [Hung] ... and this. When 'scAbsence' is true, the code will not send out any message. We can just simply do something like this, it will be faster. *if (!scAbsence) immModel_getCcbIdsForOrigCon(cb, client_id, &arrSize, &idArr);* 'arrSize' is initialized with '0' so it will not enter the 'if' block. } } free(idArr); @@ -197,20 +202,29 @@ uint32_t immnd_proc_imma_discard_connect send_evt.type = IMMSV_EVT_TYPE_IMMD; send_evt.info.immd.type = IMMD_EVT_ND2D_ADMO_HARD_FINALIZE; for (ix = 0; ix < arrSize && !(cl_node->mIsStale); ++ix) { - send_evt.info.immd.info.admoId = idArr[ix]; TRACE_5("Hard finalize of AdmOwner id:%u originating at " "dead connection: %u", idArr[ix], client_id); - if (immnd_mds_msg_send(cb, NCSMDS_SVC_ID_IMMD, cb->immd_mdest_id, + if (scAbsence) { + SaImmHandleT clnt_hdl; + MDS_DEST reply_dest; + memset(&clnt_hdl, '\0', sizeof(SaImmHandleT)); + memset(&reply_dest, '\0', sizeof(MDS_DEST)); + send_evt.info.immnd.info.admFinReq.adm_owner_id = idArr[ix]; + immnd_evt_proc_admo_hard_finalize(cb, &send_evt.info.immnd, false, clnt_hdl, reply_dest); + } else { + send_evt.info.immd.info.admoId = idArr[ix]; + if(immnd_mds_msg_send(cb, NCSMDS_SVC_ID_IMMD, cb->immd_mdest_id, &send_evt) != NCSCC_RC_SUCCESS) { - if (immnd_is_immd_up(cb)) { - LOG_ER("Failure to broadcast discard admo0wner for ccbId:%u " - "but IMMD is up !? - case not handled. Client will " - "be orphanded", implId); - } else { - LOG_WA("Failure to broadcast discard admowner for id:%u " - "(immd down)- will retry later", idArr[ix]); + if (immnd_is_immd_up(cb)) { + LOG_ER("Failure to broadcast discard admo0wner for ccbId:%u " + "but IMMD is up !? - case not handled. Client will " + "be orphanded", implId); + } else { + LOG_WA("Failure to broadcast discard admowner for id:%u " + "(immd down)- will retry later", idArr[ix]); + } + cl_node->mIsStale = true; } - cl_node->mIsStale = true; } } free(idArr); @@ -251,7 +265,7 @@ void immnd_proc_imma_down(IMMND_CB *cb, prev_hdl = cl_node->imm_app_hdl; if ((memcmp(&dest, &cl_node->agent_mds_dest, sizeof(MDS_DEST)) == 0) && sv_id == cl_node->sv_id) { - if (immnd_proc_imma_discard_connection(cb, cl_node)) { + if (immnd_proc_imma_discard_connection(cb, cl_node, false)) { TRACE_5("Removing client id:%llx sv_id:%u", cl_node->imm_app_hdl, cl_node->sv_id); immnd_client_node_del(cb, cl_node); memset(cl_node, '\0', sizeof(IMMND_IMM_CLIENT_NODE)); @@ -300,7 +314,7 @@ void immnd_proc_imma_discard_stales(IMMN prev_hdl = cl_node->imm_app_hdl; if (cl_node->mIsStale) { cl_node->mIsStale = false; - if (immnd_proc_imma_discard_connection(cb, cl_node)) { + if (immnd_proc_imma_discard_connection(cb, cl_node, false)) { TRACE_5("Removing client id:%llx sv_id:%u", cl_node->imm_app_hdl, cl_node->sv_id); immnd_client_node_del(cb, cl_node); memset(cl_node, '\0', sizeof(IMMND_IMM_CLIENT_NODE)); @@ -422,6 +436,17 @@ uint32_t immnd_introduceMe(IMMND_CB *cb) send_evt.info.immd.info.ctrl_msg.pbeEnabled, send_evt.info.immd.info.ctrl_msg.dir.size); + if(cb->mIntroduced==2) { + LOG_NO("Re-introduce-me highestProcessed:%llu highestReceived:%llu", + cb->highestProcessed, cb->highestReceived); + send_evt.info.immd.info.ctrl_msg.refresh = 2; + send_evt.info.immd.info.ctrl_msg.fevs_count = cb->highestReceived; + + send_evt.info.immd.info.ctrl_msg.admo_id_count = cb->mLatestAdmoId;; + send_evt.info.immd.info.ctrl_msg.ccb_id_count = cb->mLatestCcbId; + send_evt.info.immd.info.ctrl_msg.impl_count = cb->mLatestImplId; + } + if (!immnd_is_immd_up(cb)) { return NCSCC_RC_FAILURE; } @@ -480,7 +505,7 @@ static int32_t immnd_iAmLoader(IMMND_CB TRACE_5("Loading is not possible, preLoader still attached"); return (-3); } - +LOG_IN("ABT CLOUD PROTO cb->mMyEpoch:%u != cb->mRulingEpoch:%u", cb->mMyEpoch, cb->mRulingEpoch); if (cb->mMyEpoch != cb->mRulingEpoch) { /*We are joining the cluster, need to sync this IMMND. */ return (-2); @@ -536,7 +561,7 @@ static uint32_t immnd_requestSync(IMMND_ uint32_t rc = NCSCC_RC_SUCCESS; IMMSV_EVT send_evt; memset(&send_evt, '\0', sizeof(IMMSV_EVT)); - +LOG_NO("ABT REQUESTING SYNC"); send_evt.type = IMMSV_EVT_TYPE_IMMD; send_evt.info.immd.type = IMMD_EVT_ND2D_REQ_SYNC; send_evt.info.immd.info.ctrl_msg.ndExecPid = cb->mMyPid; @@ -546,6 +571,7 @@ static uint32_t immnd_requestSync(IMMND_ if (immnd_is_immd_up(cb)) { rc = immnd_mds_msg_send(cb, NCSMDS_SVC_ID_IMMD, cb->immd_mdest_id, &send_evt); } else { + LOG_IN("Could not request sync because IMMD is not UP"); rc = NCSCC_RC_FAILURE; } return (rc == NCSCC_RC_SUCCESS); @@ -1571,13 +1597,19 @@ static int immnd_forkPbe(IMMND_CB *cb) if (pid == 0) { /*child */ /* TODO: Should close file-descriptors ... */ /*char * const pbeArgs[5] = { (char *) execPath, "--recover", "--pbeXX", dbFilePath, 0 };*/ - char * pbeArgs[5]; + char * pbeArgs[6]; bool veteran = (cb->mIsCoord) ? (cb->mPbeVeteran) : (cb->m2Pbe && cb->mPbeVeteranB); pbeArgs[0] = (char *) execPath; - if(veteran) { + if(veteran && cb->mScAbsenceAllowed && !cb->mPbeUsesSharedFs) { + pbeArgs[1] = "--recover"; + pbeArgs[2] = "--check-objects"; + pbeArgs[3] = (cb->m2Pbe)?((cb->mIsCoord)?"--pbe2A":"--pbe2B"):"--pbe"; + pbeArgs[4] = dbFilePath; + pbeArgs[5] = 0; + } else if(veteran) { pbeArgs[1] = "--recover"; pbeArgs[2] = (cb->m2Pbe)?((cb->mIsCoord)?"--pbe2A":"--pbe2B"):"--pbe"; - pbeArgs[3] = dbFilePath; + pbeArgs[3] = dbFilePath; pbeArgs[4] = 0; } else { pbeArgs[1] = (cb->m2Pbe)?((cb->mIsCoord)?"--pbe2A":"--pbe2B"):"--pbe"; @@ -1685,7 +1717,7 @@ uint32_t immnd_proc_server(uint32_t *tim cb->mJobStart = now; } } else { /*We are not ready to start loading yet */ - if(cb->mIntroduced) { + if(cb->mIntroduced==1) { if((cb->m2Pbe == 2) && !(cb->preLoadPid)) { cb->preLoadPid = immnd_forkLoader(cb, true); } @@ -1833,6 +1865,7 @@ uint32_t immnd_proc_server(uint32_t *tim cb->mState = IMM_SERVER_READY; immnd_ackToNid(NCSCC_RC_SUCCESS); LOG_NO("SERVER STATE: IMM_SERVER_LOADING_SERVER --> IMM_SERVER_READY"); + immModel_setScAbsenceAllowed(cb); cb->mJobStart = now; if (cb->mPbeFile) {/* Pbe enabled */ cb->mRim = immModel_getRepositoryInitMode(cb); @@ -1876,6 +1909,7 @@ uint32_t immnd_proc_server(uint32_t *tim cb->mState = IMM_SERVER_READY; cb->mJobStart = now; LOG_NO("SERVER STATE: IMM_SERVER_LOADING_CLIENT --> IMM_SERVER_READY"); + immModel_setScAbsenceAllowed(cb); if (cb->mPbeFile) {/* Pbe configured */ cb->mRim = immModel_getRepositoryInitMode(cb); @@ -1896,7 +1930,9 @@ uint32_t immnd_proc_server(uint32_t *tim cb->mJobStart = now; cb->mState = IMM_SERVER_READY; immnd_ackToNid(NCSCC_RC_SUCCESS); - LOG_NO("SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM SERVER READY"); + LOG_NO("SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM_SERVER_READY"); + immModel_setScAbsenceAllowed(cb); + /* This code case duplicated in immnd_evt.c Search for: "ticket:#599" @@ -1927,7 +1963,7 @@ uint32_t immnd_proc_server(uint32_t *tim cb->mStep = 0; cb->mJobStart = now; cb->mState = IMM_SERVER_READY; - LOG_NO("SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM SERVER READY"); + LOG_NO("SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY"); } if (!(cb->mStep % 60)) { LOG_IN("Sync Phase-1, waiting for existing " @@ -1944,7 +1980,7 @@ uint32_t immnd_proc_server(uint32_t *tim cb->mStep = 0; cb->mJobStart = now; cb->mState = IMM_SERVER_READY; - LOG_NO("SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM SERVER READY"); + LOG_NO("SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY"); } /* PBE may intentionally be restarted by sync. Catch this here. */ @@ -1977,7 +2013,7 @@ uint32_t immnd_proc_server(uint32_t *tim cb->mJobStart = now; cb->mState = IMM_SERVER_READY; immnd_abortSync(cb); - LOG_NO("SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM SERVER READY"); + LOG_NO("SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY"); } else { LOG_IN("Sync Phase-2: Ccbs are terminated, IMM in " "read-only mode, forked sync process pid:%u", cb->syncPid); @@ -1991,7 +2027,7 @@ uint32_t immnd_proc_server(uint32_t *tim cb->mStep = 0; cb->mJobStart = now; cb->mState = IMM_SERVER_READY; - LOG_NO("SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM SERVER READY"); + LOG_NO("SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY"); } else if (!(cb->mSyncFinalizing)) { int status = 0; if (waitpid(cb->syncPid, &status, WNOHANG) > 0) { @@ -2031,6 +2067,11 @@ uint32_t immnd_proc_server(uint32_t *tim } } + if(cb->mIntroduced == 2) { + immnd_introduceMe(cb); + break; + } + coord = immnd_iAmCoordinator(cb); if (cb->pbePid > 0) { @@ -2275,3 +2316,28 @@ void immnd_dump_client_info(IMMND_IMM_CL } #endif + +/* Only for scAbsenceAllowed (headless hydra) */ +void immnd_proc_discard_other_nodes(IMMND_CB *cb) +{ + TRACE_ENTER(); + /* Discard all clients. */ + + IMMND_IMM_CLIENT_NODE *cl_node = NULL; + immnd_client_node_getnext(cb, 0, &cl_node); + while (cl_node) { + LOG_NO("Removing client id:%llx sv_id:%u", cl_node->imm_app_hdl, cl_node->sv_id); + osafassert(immnd_proc_imma_discard_connection(cb, cl_node, true)); + LOG_NO("ABT discard_connection OK"); + osafassert(immnd_client_node_del(cb, cl_node) == NCSCC_RC_SUCCESS); + free(cl_node); + cl_node = NULL; + LOG_NO("ABT Client node REMOVED"); + immnd_client_node_getnext(cb, 0, &cl_node); + } + + LOG_NO("ABT DONE REMOVING CLIENTS ENTERING immModel_isolateThisNode(cb) "); + immModel_isolateThisNode(cb); + immModel_abortNonCriticalCcbs(cb); + TRACE_LEAVE(); +} ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net<mailto:Opensaf-devel@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net<mailto:Opensaf-devel@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel