[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

Suresh Srinivas (JIRA) Sun, 02 Dec 2012 11:45:59 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508337#comment-13508337
 ]


Suresh Srinivas commented on HDFS-3875:
---------------------------------------

It took me a lot of time to review this code. BlockReceiver code is a poorly 
documented code. One of these days I will add some javadoc to make 
understanding the code and reviewing easier :-).

Why do you have two variants of the patch - with and without tests?

Comments for patch with no tests:
# Comment against #checkSumError could have - "Indicates checksumError. When 
set block receiving and writing is stopped." It is better to initialize it to 
false than in the constructor.
# #shouldVerifyChecksum() - could we describe the condition when checksum needs 
to be verified in javadoc? Along the lines - "Checksum verified in the 
following cases - 1. if the datanode is the last one in the pipeline with no 
mirrorOut. 2. If the block is being written by another datanode for 
replication. 3. If checksum translation is needed." There is some equivalent 
comment where shouldVerifyChecksum() is presently called. That comment can be 
removed.
# receivePacket() returned -1 earlier when a block was completely written or 
length of packet received. Now it also returns -1 on checksum error. It would 
be good to add a javadoc to this method indicating returns -1.
# receivePacket() - do you see it is a good idea to print warn/info level logs 
when returning -1 on checksum error or when checksumError is set to -1? This 
will help debugging these issues on each datanode in the pipeline using the 
logs. Given that these are rare errors it should not take up too much of log 
space.
# Comment "If there is a checksum error, responder will shut it down". Can you 
please clairfy this comment saying "responder will shut itself and interrupt 
the receiver."
# In #enqueue() - why is checksumError check in synchronized block. It can be 
outside right?

                
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Kihwal Lee
>            Priority: Blocker
>         Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

Reply via email to