Sumit Agrawal created HDDS-15443:
------------------------------------
Summary: close statemachine immediately on write failure
Key: HDDS-15443
URL: https://issues.apache.org/jira/browse/HDDS-15443
Project: Apache Ozone
Issue Type: Sub-task
Components: Ozone Datanode
Reporter: Sumit Agrawal
Assignee: Sumit Agrawal
When leader performs write () and it fails, ratis server do not respond
immediately as it wait for re-election, and other server can operate over this
request in quorum. But since leader is present, re-election do not happen or
its random to get success.
But since reply is not returned by the server, client hangs till timeout occurs
OR pipeline gets close by SCM on this error.
Since the state machine is not usable as no other request is allowed to be
processed. So its better to close, so that having below behavior:
If Leader write() fails and state machine closes,
* leader reply with ServerNotReadyException immediately
* Client will retry as per policy, till either new leader or raft group removal
* leader election will happen if leader is closed within few seconds
* Once new leader is choosen and client retry, it will return success with
majority commit
If One follower write() fails and state machine closes, Still leader will
process client request with majority node success with commit.
SCM on failure of any node,
* will close containers with cool down time (2.5 minute default)
* stop allocating any new blocks
* close pipeline after 5 min
This ensures in-progress write can finish with 2-node running if any.
Impact:
* Do not handle graceful shutdown to finish apply transaction, impact:
** If leader closes, it return failure to client waiting for reply and can
retry
** If one follower closes, majority nodes are present to process and container
closes before pipeline close
** 2-node follower failure - case have only one node having data as expected.
Below issue to be handled with separate JIRA
# 2-node failure case
# client configuration for long wait for commit-all / majority-commit and
other config
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]