[
https://issues.apache.org/jira/browse/HDFS-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501420#comment-13501420
]
Kihwal Lee commented on HDFS-3874:
----------------------------------
I've seen this happening. This is worse than it looks. In 3-repl/2-min_repl
case, the last datanode in the pipeline does not report anything and the
pipeline is recreated with the remaining two nodes. The problem is the two
nodes may have already written the corrupt data to disk. The reconstructed
pipeline will be used and the block will complete. Once the block is done, NN
will schedule replication, which will fail the two sources one by one, causing
the block to be "missing".
Looking at the code, the source DatanodeId used in corruption reporting is
propagated from the client. But when DFSClient calls writeBlock(), it passes
null as srcNode, so no one in the pipline has valid srcNode. Maybe NN should
check whether the block is under construction and the reporter was the last one
in the pipeline. In this case, all copies of the blocks should be marked as
corrupt.
In addition to this, the last one in the pipeline should synchronously return
an appropriate failure, instead of simply going away.
> Exception when client reports bad checksum to NN
> ------------------------------------------------
>
> Key: HDFS-3874
> URL: https://issues.apache.org/jira/browse/HDFS-3874
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs client, name-node
> Affects Versions: 2.0.0-alpha
> Reporter: Todd Lipcon
>
> We see the following exception in our logs on a cluster:
> {code}
> 2012-08-27 16:34:30,400 INFO org.apache.hadoop.hdfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2012-08-27 16:34:30,400 ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Cannot mark
> blk_8285012733733669474_140475196{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.29.97.219:50010|RBW]]}(same as stored)
> as corrupt because datanode :0 does not exist
> 2012-08-27 16:34:30,400 INFO org.apache.hadoop.ipc.Server: IPC Server handler
> 46 on 8020, call
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.reportBadBlocks from
> 172.29.97.219:43805: error: java.io.IOException: Cannot mark
> blk_8285012733733669474_140475196{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.29.97.219:50010|RBW]]}(same as stored)
> as corrupt because datanode :0 does not exist
> java.io.IOException: Cannot mark
> blk_8285012733733669474_140475196{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.29.97.219:50010|RBW]]}(same as stored)
> as corrupt because datanode :0 does not exist
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.markBlockAsCorrupt(BlockManager.java:1001)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.findAndMarkBlockAsCorrupt(BlockManager.java:994)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.reportBadBlocks(FSNamesystem.java:4736)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.reportBadBlocks(NameNodeRpcServer.java:537)
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.reportBadBlocks(DatanodeProtocolServerSideTranslatorPB.java:242)
> at
> org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:20032)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira