[ 
https://issues.apache.org/jira/browse/HDDS-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759973#comment-17759973
 ] 

Stephen O'Donnell commented on HDDS-9189:
-----------------------------------------

On the read path, for both Ratis and EC, all replicas available in SCM are 
returned in a read pipeline, both CLOSED, replicas transitioning state and 
UNHEALTHY.

After that, what happens on the client depends on whether Ratis or EC is being 
read.

Ratis:

For Ratis, any replica is as good as any other to read from. In many cases and 
unhealthy container will have a problem with only a single or small number of 
blocks. The client will attempt to read from any of the replicas using its 
usual algorithm, such as network distance. If it encounters an error reading 
(eg block checksum mis-match or timeout) it will simply try another replica 
until it runs out of replicas. At that time it will throw an error.

This means that if all 3 replicas are unhealthy, and the currption in each 
replica is small and isolated to different blocks, the container will still be 
fully readable.

EC:

EC will work in a similar fashion, but it is important to keep in mind that EC 
has two read paths. Normal reads are performed when all replicas are available. 
Reconstruction reads are performed when one or more data blocks are unavailable.

For EC, when we have the following (H = healthy, U = Unhealthy):

{code}
Replica Index: 1 2 3 4 5
Replica        H H H U U
{code}

The read should be performed via the Normal path without any issues, as the 
parity replicas are not needed for a normal read.

With:

{code}
Replica Index: 1 2 3 4 5
Replica        H H U H H
{code}

If the block being read encounters an error from the Unhealthy container, the 
client will seemlessly failover to the reconstruction read. If the block is 
read successfully from the Unhealthy container, then the normal read will occur.

With:

{code}
Replica Index: 1 2 3  4 5
Replica        H H UH H H
{code}

In this case, we have two copies of index 3. One unhealthy and one healthy. 
This means there is a spare copy for this index. If the read first tries to use 
the unhealthy replica and fails, it will try to read the data from the spare, 
before falling back to reconstruction reads.

If we have something like:

{code}
Replica Index: 1 2 3 4 5
Replica        U U U - -
{code}

Ie under replicated with all unhealthy, then it is unlikely all the data will 
be able to be read. This is because all replicas are needed to read all blocks. 
Depending on the problems in the containers a large amount of the container is 
likely to be readable.

In summary, even with unhealthy container, both Ratis and the EC read paths can 
handle error gracefully and will allow for as much of the data as possible to 
be read. If EC has over replicated copies with a mix of healthy and unhealthy, 
the spare replica can be used to recover from an error before going to 
reconstruction reads.

As an unhealthy container is most likely got an error in only a few blocks, and 
Replication manager should replace the unhealthy ones quickly, it doesn't feel 
like we have anything we need to do in this Jira.

> EC: Investigage how EC read path deals with unhealthy replicas
> --------------------------------------------------------------
>
>                 Key: HDDS-9189
>                 URL: https://issues.apache.org/jira/browse/HDDS-9189
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> When SCM provides a read pipeline to a client, it simply returns a pipeline 
> containing all replicas it knows about. The state of a replica is not 
> included in the pipeline, so the client cannot distinguish between healthy 
> and unhealthy replicas. However the pipeline nodes are sorted, so that if 
> there are multiple replicas for a given replica index, and some of those 
> nodes are not IN_SERVICE (eg decommissioning etc), then the IN_SERVICE nodes 
> are always sorted first.
> On SCM we should be able to do something similar to put UNHEALTHY replicas to 
> the back of the list, so the client will try to read healthy ones first.
> It is also worth checking what the client does when it encounters an 
> unhealthy replica - will it fall back to reconstruction read, or if there is 
> a spare index will it use it first?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to