[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

Todd Lipcon (JIRA) Thu, 30 Aug 2012 16:15:10 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445411#comment-13445411
 ]


Todd Lipcon commented on HDFS-3875:
-----------------------------------

Just to brainstorm, here's one potential solution:
- if the tail node in the pipeline detects a checksum error, then it returns a 
special error code back up the pipeline indicating this (rather than just 
disconnecting)
- if a non-tail node receives this error code, then it immediately scans its 
own block on disk (from the beginning up through the last acked length). If it 
detects a corruption on its local copy, then it should assume that _it_ is the 
faulty one, rather than the downstream neighbor. If it detects no corruption, 
then the faulty node is either the downstream mirror or the network link 
between the two, and the current behavior is reasonable.

Depending on the above, it would report back the errorIndex appropriately to 
the client, so that the correct faulty node is removed from the pipeline.
                
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node, hdfs client
>    Affects Versions: 2.2.0-alpha
>            Reporter: Todd Lipcon
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

Reply via email to