[ 
https://issues.apache.org/jira/browse/HDDS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-15443:
----------------------------------
    Labels: pull-request-available  (was: )

> close statemachine immediately on write failure
> -----------------------------------------------
>
>                 Key: HDDS-15443
>                 URL: https://issues.apache.org/jira/browse/HDDS-15443
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Ozone Datanode
>            Reporter: Sumit Agrawal
>            Assignee: Sumit Agrawal
>            Priority: Major
>              Labels: pull-request-available
>
> When leader performs write () and it fails, ratis server do not respond 
> immediately as it wait for re-election, and other server can operate over 
> this request in quorum. But since leader is present, re-election do not 
> happen or its random to get success.
>  
> But since reply is not returned by the server, client hangs till timeout 
> occurs OR pipeline gets close by SCM on this error.
>  
> Since the state machine is not usable as no other request is allowed to be 
> processed. So its better to close, so that having below behavior:
> If Leader write() fails and state machine closes,
>  * leader reply with ServerNotReadyException immediately
>  * Client will retry as per policy, till either new leader or raft group 
> removal
>  * leader election will happen if leader is closed within few seconds
>  * Once new leader is choosen and client retry, it will return success with 
> majority commit
>  
> If One follower write() fails and state machine closes, Still leader will 
> process client request with majority node success with commit.
>  
> SCM on failure of any node, 
>  * will close containers with cool down time (2.5 minute default)
>  * stop allocating any new blocks
>  * close pipeline after 5 min
> This ensures in-progress write can finish with 2-node running if any.
>  
> Impact:
>  * Do not handle graceful shutdown to finish apply transaction, impact:
>  ** If leader closes, it return failure to client waiting for reply and can 
> retry
>  ** If one follower closes, majority nodes are present to process and 
> container closes before pipeline close
>  ** 2-node follower failure - case have only one node having data as expected.
>  
> Below issue to be handled with separate JIRA
>  # 2-node failure case
>  # client configuration for long wait for commit-all / majority-commit and 
> other config
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to