- **status**: review --> fixed


---

** [tickets:#535] IMM: Retry on ccb-apply can cause divergence on commit/abort 
decision between IMMNDs.**

**Status:** fixed
**Created:** Thu Aug 08, 2013 09:23 AM UTC by Anders Bjornerstedt
**Last Updated:** Mon Aug 12, 2013 10:57 AM UTC
**Owner:** Anders Bjornerstedt

The symptom is that a CCB is committed in the cluster:

On SC-1:
Jul  4 12:26:30 SC-1 osafimmnd[12180]: NO Ccb 63 COMMITTED (OAMSA19)

Same log message appears on all nodes except on PL-8:

Jul  4 12:26:30 PL-8 osafimmnd[17948]: NO ERR_FAILED_OPERATION: Ccb not in 
correct state (7) for Apply
Jul  4 12:26:30 PL-8 osafimmnd[17948]: WA CCB 63 is in critical state, can not 
abort
Jul  4 12:26:30 PL-8 osafimmnd[17948]: WA Will not terminate ccb 63 in critical 
state
Jul  4 12:26:30 PL-8 osafimmnd[17948]: NO Ccb 61 COMMITTED (immcfg_SC-1_17144)
Jul  4 12:26:30 PL-8 osafimmnd[17948]: NO Ccb 63 ABORTED (OAMSA19)

As can be seen, the CCB is being processed for commit also at PL-8 when
apparently a spurious redundant apply request arrives for the same ccb-id.
This has the catastrophic effect of aborting the ccb locally at PL-8.
-----------------------------------------------------

An important clue is a the sign of PBE backlog and a series of retries due to
TRY_AGAIN on saImmOmCcbApply at SC-1. Note that the printed number '6' in the
backlog message is the tieout in seconds set for that ccb. Whereas the printout
of '6' in the error message to CcbApply is the error code TRY_AGAIN.
I was initially condused by this and though the ccb-apply had received 
ERR_TIMEOUT, but that is not the case here. 
--------------------------------------------------------

Jul  4 12:26:28 SC-1 osafimmnd[12180]: NO Ccb 49 COMMITTED (OAMSA8)
Jul  4 12:26:28 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
Jul  4 12:26:28 SC-1 osafimmnd[12180]: WA Timeout (6) on transaction in 
critical state! ccb:58
Jul  4 12:26:28 SC-1 osafimmnd[12180]: WA PBE implementer 18 seems hung!
Jul  4 12:26:28 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
Jul  4 12:26:29 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
Jul  4 12:26:29 SC-1 osafimmnd[12180]: WA Timeout (6) on transaction in 
critical state! ccb:58
Jul  4 12:26:29 SC-1 osafimmnd[12180]: WA Timeout (6) on transaction in 
critical state! ccb:59
Jul  4 12:26:29 SC-1 osafimmnd[12180]: WA PBE implementer 18 seems hung!
Jul  4 12:26:29 SC-1 osafimmnd[12180]: WA PBE implementer 18 seems hung!
Jul  4 12:26:29 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
Jul  4 12:26:29 SC-1 osafimmnd[12180]: NO Ccb 58 COMMITTED (OAMSA17)
Jul  4 12:26:29 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
------------------------------------------------------
The discrepancy is detected a few minutes later in a sync verify where veteran 
nodes (nodes that are already up in the cluster)
use any sync not to sync but to verify that their imm state matches the coord 
state. In this case PL_8 detects
that it has a different transaction outcome and asserts in the immnd, followed 
by a restart and sync of that immnd.
--------------------------------------------------------------
Jul  4 12:28:27 PL-8 osafimmnd[17948]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Jul  4 12:28:33 PL-8 osafimmnd[17948]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 
14277
Jul  4 12:28:33 PL-8 osafimmnd[17948]: ER ccb->mState:10  !=  ol->ccbState:9 
for 
CCB:63
Jul  4 12:28:33 PL-8 osafimmnd[17948]: ImmModel.cc:14516: finalizeSync: 
Assertion 'ccb->mState == (ImmCcbState) ol->ccbState' failed.
Jul  4 12:28:33 PL-8 osafamfnd[17993]: NO 
'safComp=IMMND,safSu=PL-8,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' 
: 
Recovery is 'componentRestart'
---------------------------------------------------------------
But this "corrective action" is of course not something to be relied upon
since (a) The  node PL-8 is inconsistent for minutes; (b) Ccb-outcomes 
older than 5 minutes are garbage collected in the immnd; and (c) the 
node being inconsistent could be the immnd-coord, forcing all other veterans
to conform to the inconsistent result.

But the sync verify did its job in the sense that it caught the problem
and generated a core-dump which generated a trouble  report..




---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to