[ 
https://issues.apache.org/jira/browse/HDDS-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886264#comment-17886264
 ] 

Tsz-wo Sze edited comment on HDDS-10798 at 10/2/24 11:24 PM:
-------------------------------------------------------------

For the cluster with HDDS-10546 but without HDDS-10798, OM leader can remain 
not ready indefinitely. Below are some details:
 - Step 0: OM startup
{code:java}
ratisApplied:  9  // the applied index in Ratis
ozoneApplied:  9  // the applied index in Ozone
lastSkipped : -1  // the last index skipped in notifyTermIndexUpdated(..)
lastNotified: -1  // the last index passed to notifyTermIndexUpdated(..)
{code}

 - Step 1: applyTransaction(logEntryIndex=10)
{code:java}
double buffer: 10
ozoneApplied : 9       // unchanged: double buffer is not yet flushed
ratisApplied : 9 -> 10 // see Note 1 below
{code}

 ** Note 1: ratisApplied and ozoneApplied can be different due to the OM double 
buffer. In OzoneManagerStateMachine.applyTransaction(..), it stores the 
transaction to the double buffer (instead of applying it) and then return 
complete to ratis. As a result, ratisApplied will be incremented but 
ozoneApplied will not until double buffer flush.

 - Step 2: OM becomes the Leader and writes STARTUP_ENTRY with index=11
{code:java}
Ratis apply: STARTUP_ENTRY with index=11 // see Note 2 below
  ratisApplied : 10 -> 11
notifyTermIndexUpdated(newIndex=11)
  lastSkipped  : -1 -> 10 // see Note 3 below
  lastNotified : -1 -> 11
updateLastAppliedTermIndex? no since lastNotified(11) - ozoneApplied(9) != 1
{code}

 ** Note 2: The type of STARTUP_ENTRY is CONF_ENTRY_TYPE.
 ** Note 3: Only the indices of non-state machine log entries will be passed to 
notifyTermIndexUpdated(..). Suppose the indices passed to 
notifyTermIndexUpdated(..) are
{code:java}
204, (), 206, (), (), 209, 210.  {code}
Then, lastSkipped will be 208 and lastNotified will be 210.

 - Step 3: (optional)
{code:java}
Ratis apply: META_ENTRY with index=12
  ratisApplied : 11 -> 12
notifyTermIndexUpdated(newIndex=12)
  lastSkipped  : 10 [unchanged: newIndex(12) - lastNotified(11) = 1]
  lastNotified : 11 -> 12
updateLastAppliedTermIndex? no since lastNotified(12) - ozoneApplied(9) != 1
{code}

 - Step 4: BUG
{code:java}
Doube buffer flush: 10
updateLastAppliedIndex(newTermIndex=10)
  C1: newTermIndex(10) < lastNotified(12) is true
  C2: ozoneApplied(9) >= lastSkipped(10) is false
    newTermIndex: 10 (unchanged)
  ozoneApplied : 9 -> 10
  The Leader remains not ready since ozoneApplied(9) < STARTUP_ENTRY(11)
{code}

 - Step 4': FIX, in C2, use newTermIndex instead of ozoneApplied
{code:java}
Doube buffer flush: 10
updateLastAppliedIndex(newTermIndex=10)
  C1: newTermIndex(10) < lastNotified(12) is true
  C2: newTermIndex(10) >= lastSkipped(10) is true
    newTermIndex: 10 -> 12
  ozoneApplied : 9 -> 12
  The Leader becomes ready since ozoneApplied(12) >= STARTUP_ENTRY(11)
{code}


was (Author: szetszwo):
For the cluster with HDDS-10546 but without HDDS-10798, OM leader can remain 
not ready indefinitely. Below are some details:
 - Step 0: OM startup
{code:java}
ratisApplied:  9  // the applied index in Ratis
ozoneApplied:  9  // the applied index in Ozone
lastSkipped : -1  // the last index skipped in notifyTermIndexUpdated(..)
lastNotified: -1  // the last index passed to notifyTermIndexUpdated(..)
{code}
 - Step 1: applyTransaction(logEntryIndex=10)
{code:java}
double buffer: 10
ozoneApplied : 9 [unchanged: double buffer is not yet flushed]
ratisApplied : 9 -> 10
{code}
-* Note 1: ratisApplied and ozoneApplied can be different due to the OM double 
buffer.  In OzoneManagerStateMachine.applyTransaction(..), it stores the 
transaction to the double buffer (instead of applying it) and then return 
complete to ratis.  As a result, ratisApplied will be incremented but 
ozoneApplied will not until double buffer flush.

 - Step 2: OM becomes the Leader and writes STARTUP_ENTRY with index=11
{code:java}
Ratis apply: STARTUP_ENTRY with index=11
  ratisApplied : 10 -> 11
notifyTermIndexUpdated(newIndex=11)
  lastSkipped  : -1 -> 10
  lastNotified : -1 -> 11
updateLastAppliedTermIndex? no since lastNotified(11) - ozoneApplied(9) != 1
{code}
-* Note 2:   The type of STARTUP_ENTRY is CONF_ENTRY_TYPE.
-* Note 3: Only the indices of non-state machine log entries will be passed to 
notifyTermIndexUpdated(..).  Suppose the indices passed to 
notifyTermIndexUpdated(..) are
{code}204, (), 206, (), (), 209, 210.  {code}
Then, lastSkipped will be 208 and lastNotified will be 210.

 - Step 3: (optional)
{code:java}
Ratis apply: META_ENTRY with index=12
  ratisApplied : 11 -> 12
notifyTermIndexUpdated(newIndex=12)
  lastSkipped  : 10 [unchanged: newIndex(12) - lastNotified(11) = 1]
  lastNotified : 11 -> 12
updateLastAppliedTermIndex? no since lastNotified(12) - ozoneApplied(9) != 1
{code}

 - Step 4: BUG
{code:java}
Doube buffer flush: 10
updateLastAppliedIndex(newTermIndex=10)
  C1: newTermIndex(10) < lastNotified(12) is true
  C2: ozoneApplied(9) >= lastSkipped(10) is false
    newTermIndex: 10 (unchanged)
  ozoneApplied : 9 -> 10
  The Leader remains not ready since ozoneApplied(9) < STARTUP_ENTRY(11)
{code}

 - Step 4': FIX, in C2, use newTermIndex instead of ozoneApplied
{code:java}
Doube buffer flush: 10
updateLastAppliedIndex(newTermIndex=10)
  C1: newTermIndex(10) < lastNotified(12) is true
  C2: newTermIndex(10) >= lastSkipped(10) is true
    newTermIndex: 10 -> 12
  ozoneApplied : 9 -> 12
  The Leader becomes ready since ozoneApplied(12) >= STARTUP_ENTRY(11)
{code}

> OMLeaderNotReadyException exception on switch leader
> ----------------------------------------------------
>
>                 Key: HDDS-10798
>                 URL: https://issues.apache.org/jira/browse/HDDS-10798
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM HA
>            Reporter: Sumit Agrawal
>            Assignee: Sumit Agrawal
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>
> Client is recieving exception as LeaderNotReady;
> {code:java}
> 2024-05-02 13:54:07,941 DEBUG [IPC Server handler 70 on 
> 9862]-org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB:
>  om72 is Leader but not ready to process request yet.{code}
>  
> As part of fix  HDDS-10546, one of scenario is missing,
>  * notifyTermIndexUpdate set lastSkippedIndex as few transaction still in 
> double buffer
>  * doubleBuffer notify update index does not update lastNotifiedTermIndex as 
> check 'lastApplied.getIndex() >= lastSkippedIndex' fails, as lastApplied is 
> much older value
> This is random issue where, When election happens and there are transaction 
> in double buffer, this can impact not updating notified transactionId. This 
> can be recovered after restart of OM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to