[
https://issues.apache.org/jira/browse/HDDS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-15443:
----------------------------------
Labels: pull-request-available (was: )
> close statemachine immediately on write failure
> -----------------------------------------------
>
> Key: HDDS-15443
> URL: https://issues.apache.org/jira/browse/HDDS-15443
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Datanode
> Reporter: Sumit Agrawal
> Assignee: Sumit Agrawal
> Priority: Major
> Labels: pull-request-available
>
> When leader performs write () and it fails, ratis server do not respond
> immediately as it wait for re-election, and other server can operate over
> this request in quorum. But since leader is present, re-election do not
> happen or its random to get success.
>
> But since reply is not returned by the server, client hangs till timeout
> occurs OR pipeline gets close by SCM on this error.
>
> Since the state machine is not usable as no other request is allowed to be
> processed. So its better to close, so that having below behavior:
> If Leader write() fails and state machine closes,
> * leader reply with ServerNotReadyException immediately
> * Client will retry as per policy, till either new leader or raft group
> removal
> * leader election will happen if leader is closed within few seconds
> * Once new leader is choosen and client retry, it will return success with
> majority commit
>
> If One follower write() fails and state machine closes, Still leader will
> process client request with majority node success with commit.
>
> SCM on failure of any node,
> * will close containers with cool down time (2.5 minute default)
> * stop allocating any new blocks
> * close pipeline after 5 min
> This ensures in-progress write can finish with 2-node running if any.
>
> Impact:
> * Do not handle graceful shutdown to finish apply transaction, impact:
> ** If leader closes, it return failure to client waiting for reply and can
> retry
> ** If one follower closes, majority nodes are present to process and
> container closes before pipeline close
> ** 2-node follower failure - case have only one node having data as expected.
>
> Below issue to be handled with separate JIRA
> # 2-node failure case
> # client configuration for long wait for commit-all / majority-commit and
> other config
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]