[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

Kihwal Lee (JIRA) Wed, 06 Mar 2013 13:24:15 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595134#comment-13595134
 ]


Kihwal Lee commented on HDFS-3875:
----------------------------------

The new patch forces datanodes to truncate the block being recovered to the 
acked length.  Since the nodes in the middle of write pipeline does not perform 
checksum verification and writes data to disk before getting ack back from 
downstream, the unacked portion can contain corrupt data. If the last node 
simply disappears before reporting a checksum error up, the current pipeline 
recovery mechanism can overlook the corruption in written data.

Since this truncation discards potentially corrupt portion of block, we do not 
need any explicit checksum re-verification on checksum error.

Another new feature added to the latest patch is to terminate hdfs client when 
pipeline recovery is attempted for more than 5 times while writing the same 
data packet. This likely indicates the source data is corrupt.  In a very small 
cluster, clients may run out of datanodes and fail before retrying 5 times.
                
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

Reply via email to