[jira] [Updated] (HDDS-9826) Fix exception handling if one Datanode is not available (Ratis)

Ivan Brusentsev (Jira) Mon, 04 Dec 2023 05:12:54 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Brusentsev updated HDDS-9826:
----------------------------------
    Description: 
When a key is uploading by XcieverClientRatis, and some datanode becomes 
unavailable, it is expected that client should request new pipeline to retry 
upload.

In fact, before that client tries to repeat commit check with 
_MAJORITY_COMMITTED_ replication level, which cannot be successful as at that 
moment pipeline is already closed.

XceiverClientRatis has method watchForCommit(long index), which contains 
exception check

 
{code:java}
if (t instanceof GroupMismatchException) {
  throw e;
}
{code}
GroupMismatchException throws by Ratis client exactly when some datanode is not 
available and further key upload is not available for current pipeline.

But this check does not work as 
{code:java}
Throwable t = HddsClientUtils.checkForException(e);{code}
 does not unwrap exception completely.

The idea is fix lookup of nested exceptions to find proper one. This improves 
failover latency by 15 seconds approximately.

  was:
When a key is uploading by XcieverClientRatis, and some datanode becomes 
unavailable, it is expected that client should request new pipeline to retry 
upload.

In fact, before that client tries to repeat commit check with 
_MAJORITY_COMMITTED_ replication level, which cannot be successful as at that 
moment pipeline is already closed.

XceiverClientRatis has method watchForCommit(long index), which contains 
exception check

 
{code:java}
if (t instanceof GroupMismatchException) {
  throw e;
}
{code}
GroupMismatchException throws by Ratis client exactly when some datanode is not 
available and further key upload is not available for current pipeline.

But this check does not work as 
{code:java}
Throwable t = HddsClientUtils.checkForException(e);{code}
 does not unwrap exception completely.

The idea is fix lookup of nested exceptions to find proper one. This improve 
failover latency by 15 seconds approximately.


> Fix exception handling if one Datanode is not available (Ratis)
> ---------------------------------------------------------------
>
>                 Key: HDDS-9826
>                 URL: https://issues.apache.org/jira/browse/HDDS-9826
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: SCM Client
>    Affects Versions: 1.3.0
>            Reporter: Ivan Brusentsev
>            Assignee: Ivan Brusentsev
>            Priority: Minor
>
> When a key is uploading by XcieverClientRatis, and some datanode becomes 
> unavailable, it is expected that client should request new pipeline to retry 
> upload.
> In fact, before that client tries to repeat commit check with 
> _MAJORITY_COMMITTED_ replication level, which cannot be successful as at that 
> moment pipeline is already closed.
> XceiverClientRatis has method watchForCommit(long index), which contains 
> exception check
>  
> {code:java}
> if (t instanceof GroupMismatchException) {
>   throw e;
> }
> {code}
> GroupMismatchException throws by Ratis client exactly when some datanode is 
> not available and further key upload is not available for current pipeline.
> But this check does not work as 
> {code:java}
> Throwable t = HddsClientUtils.checkForException(e);{code}
>  does not unwrap exception completely.
> The idea is fix lookup of nested exceptions to find proper one. This improves 
> failover latency by 15 seconds approximately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-9826) Fix exception handling if one Datanode is not available (Ratis)

Reply via email to