[
https://issues.apache.org/jira/browse/HDDS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze updated HDDS-15443:
------------------------------
Description:
(Revised by [~szetszwo])
When a datanode writeStateMachineData fails (e.g. disk-out-of-space) for a log
entry of a client request, the applyTransaction can never happen for that
request. Also, the datanode cannot append log entires anymore since
writeStateMachineData failure is a RaftLog failure.
- (Bad case) If the datanode is a leader, it will never respond to the client
for that request. The client will keep waiting for that request and retrying
until either its RetryPolicy stops retrying or the RaftGroup is removed by SCM.
By the default conf, the SCM will remove the RaftGroup in ~5min and the client
will retry for much longer than 5min. As a result the client will hang for 5min
and cannot write any other requests.
- (Better case) If the datanode is a follower, it will stop working since it
cannot cannot append log entires anymore. The client is able to receive a reply
from the leader for that request. Then, it will watch for ALL_COMMIT for the
log entry of that request. Since a follower has failed, the watch ALL_COMMIT
can never receive a reply unit watch timeout (default 3min). In this case, the
client can continue writing other requests while it is waiting for the watch.
was:
When leader performs write () and it fails, ratis server do not respond
immediately as it wait for re-election, and other server can operate over this
request in quorum. But since leader is present, re-election do not happen or
its random to get success.
But since reply is not returned by the server, client hangs till timeout occurs
OR pipeline gets close by SCM on this error.
Since the state machine is not usable as no other request is allowed to be
processed. So its better to close, so that having below behavior:
If Leader write() fails and state machine closes,
* leader reply with ServerNotReadyException immediately
* Client will retry as per policy, till either new leader or raft group removal
* leader election will happen if leader is closed within few seconds
* Once new leader is choosen and client retry, it will return success with
majority commit
If One follower write() fails and state machine closes, Still leader will
process client request with majority node success with commit.
SCM on failure of any node,
* will close containers with cool down time (2.5 minute default)
* stop allocating any new blocks
* close pipeline after 5 min
This ensures in-progress write can finish with 2-node running if any.
Impact:
* Do not handle graceful shutdown to finish apply transaction, impact:
** If leader closes, it return failure to client waiting for reply and can
retry
** If one follower closes, majority nodes are present to process and container
closes before pipeline close
** 2-node follower failure - case have only one node having data as expected.
Below issue to be handled with separate JIRA
# 2-node failure case
# client configuration for long wait for commit-all / majority-commit and
other config
Summary: Close statemachine immediately on writeStateMachineData
failure (was: close statemachine immediately on write failure)
[~sumitagrawl], let's focus on only the problem this JIRA trying to address.
Revised the Summary and the Description
> Close statemachine immediately on writeStateMachineData failure
> ---------------------------------------------------------------
>
> Key: HDDS-15443
> URL: https://issues.apache.org/jira/browse/HDDS-15443
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Datanode
> Reporter: Sumit Agrawal
> Assignee: Sumit Agrawal
> Priority: Major
> Labels: pull-request-available
>
> (Revised by [~szetszwo])
> When a datanode writeStateMachineData fails (e.g. disk-out-of-space) for a
> log entry of a client request, the applyTransaction can never happen for that
> request. Also, the datanode cannot append log entires anymore since
> writeStateMachineData failure is a RaftLog failure.
> - (Bad case) If the datanode is a leader, it will never respond to the
> client for that request. The client will keep waiting for that request and
> retrying until either its RetryPolicy stops retrying or the RaftGroup is
> removed by SCM. By the default conf, the SCM will remove the RaftGroup in
> ~5min and the client will retry for much longer than 5min. As a result the
> client will hang for 5min and cannot write any other requests.
> - (Better case) If the datanode is a follower, it will stop working since it
> cannot cannot append log entires anymore. The client is able to receive a
> reply from the leader for that request. Then, it will watch for ALL_COMMIT
> for the log entry of that request. Since a follower has failed, the watch
> ALL_COMMIT can never receive a reply unit watch timeout (default 3min). In
> this case, the client can continue writing other requests while it is waiting
> for the watch.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]