[tickets] [opensaf:tickets] #535 IMM: Retry on ccb-apply can cause divergence on commit/abort decision between IMMNDs.

Anders Bjornerstedt Thu, 08 Aug 2013 02:40:54 -0700

The root cause is a bug in the "backpresure fix" I created about a 
year ago to reduce the problems with the PBE being chronically
clogged up by PRTO updates and ccbs typically backlogged on top.


That backpressure mechanism gives CCB apply TRY_AGAIN if 
there are CCBs (in critical waiting for PBE only) and these CCBs
have been stuck for a >>time<< longer than a threshold. 
The problem is ´that  this criterion, the waiting time, is not an
exact guaranteed cluster global consistent criteria.

The resolution used is seconds, but the rounding to seconds done by
time(2) obviosly may flip one way or the other at the different nodes.
This may result in the apply request being rejected with TRY_AGAIN at
some nodes yet queued up to wait for the PBE decision at others.

So what happens is that the a ccb-apply is rejected at SC-1 with
TRY_AGAIN but accepted at PL-8 (due to the fevs message arriving
earlier in real-time there and thus the waited time not having passed
the 6 second threshold at pL-8 but doing so at the other nodes.

The problem was introduced in changeset:

changeset:   3649:8caba07ad3b8
branch:      opensaf-4.2.x
user:        Anders Bjornerstedt <[email protected]>
date:        Fri Jul 20 12:46:42 2012 +0200
summary:     immsv: Throttle writes to PBE (#2718)



---

** [tickets:#535] IMM: Retry on ccb-apply can cause divergence on commit/abort 
decision between IMMNDs.**

**Status:** assigned
**Created:** Thu Aug 08, 2013 09:23 AM UTC by Anders Bjornerstedt
**Last Updated:** Thu Aug 08, 2013 09:23 AM UTC
**Owner:** Anders Bjornerstedt

The symptom is that a CCB is committed in the cluster:

On SC-1:
Jul  4 12:26:30 SC-1 osafimmnd[12180]: NO Ccb 63 COMMITTED (OAMSA19)

Same log message appears on all nodes except on PL-8:

Jul  4 12:26:30 PL-8 osafimmnd[17948]: NO ERR_FAILED_OPERATION: Ccb not in 
correct state (7) for Apply
Jul  4 12:26:30 PL-8 osafimmnd[17948]: WA CCB 63 is in critical state, can not 
abort
Jul  4 12:26:30 PL-8 osafimmnd[17948]: WA Will not terminate ccb 63 in critical 
state
Jul  4 12:26:30 PL-8 osafimmnd[17948]: NO Ccb 61 COMMITTED (immcfg_SC-1_17144)
Jul  4 12:26:30 PL-8 osafimmnd[17948]: NO Ccb 63 ABORTED (OAMSA19)

As can be seen, the CCB is being processed for commit also at PL-8 when
apparently a spurious redundant apply request arrives for the same ccb-id.
This has the catastrophic effect of aborting the ccb locally at PL-8.
-----------------------------------------------------

An important clue is a the sign of PBE backlog and a series of retries due to
TRY_AGAIN on saImmOmCcbApply at SC-1. Note that the printed number '6' in the
backlog message is the tieout in seconds set for that ccb. Whereas the printout
of '6' in the error message to CcbApply is the error code TRY_AGAIN.
I was initially condused by this and though the ccb-apply had received 
ERR_TIMEOUT, but that is not the case here. 
--------------------------------------------------------

Jul  4 12:26:28 SC-1 osafimmnd[12180]: NO Ccb 49 COMMITTED (OAMSA8)
Jul  4 12:26:28 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
Jul  4 12:26:28 SC-1 osafimmnd[12180]: WA Timeout (6) on transaction in 
critical state! ccb:58
Jul  4 12:26:28 SC-1 osafimmnd[12180]: WA PBE implementer 18 seems hung!
Jul  4 12:26:28 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
Jul  4 12:26:29 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
Jul  4 12:26:29 SC-1 osafimmnd[12180]: WA Timeout (6) on transaction in 
critical state! ccb:58
Jul  4 12:26:29 SC-1 osafimmnd[12180]: WA Timeout (6) on transaction in 
critical state! ccb:59
Jul  4 12:26:29 SC-1 osafimmnd[12180]: WA PBE implementer 18 seems hung!
Jul  4 12:26:29 SC-1 osafimmnd[12180]: WA PBE implementer 18 seems hung!
Jul  4 12:26:29 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
Jul  4 12:26:29 SC-1 osafimmnd[12180]: NO Ccb 58 COMMITTED (OAMSA17)
Jul  4 12:26:29 SC-1 com: COM_SA ImmCmdOmCcbApply::ImmCmdOmCcbApply fail err = 6
------------------------------------------------------
The discrepancy is detected a few minutes later in a sync verify where veteran 
nodes (nodes that are already up in the cluster)
use any sync not to sync but to verify that their imm state matches the coord 
state. In this case PL_8 detects
that it has a different transaction outcome and asserts in the immnd, followed 
by a restart and sync of that immnd.
--------------------------------------------------------------
Jul  4 12:28:27 PL-8 osafimmnd[17948]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Jul  4 12:28:33 PL-8 osafimmnd[17948]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 
14277
Jul  4 12:28:33 PL-8 osafimmnd[17948]: ER ccb->mState:10  !=  ol->ccbState:9 
for 
CCB:63
Jul  4 12:28:33 PL-8 osafimmnd[17948]: ImmModel.cc:14516: finalizeSync: 
Assertion 'ccb->mState == (ImmCcbState) ol->ccbState' failed.
Jul  4 12:28:33 PL-8 osafamfnd[17993]: NO 
'safComp=IMMND,safSu=PL-8,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' 
: 
Recovery is 'componentRestart'
---------------------------------------------------------------
But this "corrective action" is of course not something to be relied upon
since (a) The  node PL-8 is inconsistent for minutes; (b) Ccb-outcomes 
older than 5 minutes are garbage collected in the immnd; and (c) the 
node being inconsistent could be the immnd-coord, forcing all other veterans
to conform to the inconsistent result.

But the sync verify did its job in the sense that it caught the problem
and generated a core-dump which generated a trouble  report..




---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #535 IMM: Retry on ccb-apply can cause divergence on commit/abort decision between IMMNDs.

Reply via email to