[
https://issues.apache.org/jira/browse/HDDS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Brusentsev updated HDDS-9826:
----------------------------------
Description:
When a key is uploading by XcieverClientRatis, and some datanode becomes
unavailable, it is expected that client should request new pipeline to retry
upload.
In fact, before that client tries to repeat commit check with
_MAJORITY_COMMITTED_ replication level, which cannot be successful as at that
moment pipeline is already closed.
XceiverClientRatis has method watchForCommit(long index), which contains
exception check
{code:java}
if (t instanceof GroupMismatchException) {
throw e;
}
{code}
GroupMismatchException throws by Ratis client exactly when some datanode is not
available and further key upload is not available for current pipeline.
But this check does not work as
{code:java}
Throwable t = HddsClientUtils.checkForException(e);{code}
does not unwrap exception completely.
The idea is fix lookup of nested exceptions to find proper one. This improves
failover latency by 15 seconds approximately.
was:
When a key is uploading by XcieverClientRatis, and some datanode becomes
unavailable, it is expected that client should request new pipeline to retry
upload.
In fact, before that client tries to repeat commit check with
_MAJORITY_COMMITTED_ replication level, which cannot be successful as at that
moment pipeline is already closed.
XceiverClientRatis has method watchForCommit(long index), which contains
exception check
{code:java}
if (t instanceof GroupMismatchException) {
throw e;
}
{code}
GroupMismatchException throws by Ratis client exactly when some datanode is not
available and further key upload is not available for current pipeline.
But this check does not work as
{code:java}
Throwable t = HddsClientUtils.checkForException(e);{code}
does not unwrap exception completely.
The idea is fix lookup of nested exceptions to find proper one. This improve
failover latency by 15 seconds approximately.
> Fix exception handling if one Datanode is not available (Ratis)
> ---------------------------------------------------------------
>
> Key: HDDS-9826
> URL: https://issues.apache.org/jira/browse/HDDS-9826
> Project: Apache Ozone
> Issue Type: Task
> Components: SCM Client
> Affects Versions: 1.3.0
> Reporter: Ivan Brusentsev
> Assignee: Ivan Brusentsev
> Priority: Minor
>
> When a key is uploading by XcieverClientRatis, and some datanode becomes
> unavailable, it is expected that client should request new pipeline to retry
> upload.
> In fact, before that client tries to repeat commit check with
> _MAJORITY_COMMITTED_ replication level, which cannot be successful as at that
> moment pipeline is already closed.
> XceiverClientRatis has method watchForCommit(long index), which contains
> exception check
>
> {code:java}
> if (t instanceof GroupMismatchException) {
> throw e;
> }
> {code}
> GroupMismatchException throws by Ratis client exactly when some datanode is
> not available and further key upload is not available for current pipeline.
> But this check does not work as
> {code:java}
> Throwable t = HddsClientUtils.checkForException(e);{code}
> does not unwrap exception completely.
> The idea is fix lookup of nested exceptions to find proper one. This improves
> failover latency by 15 seconds approximately.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]