[ 
https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496446#comment-14496446
 ] 

Harsh J commented on HDFS-8113:
-------------------------------

Stale block copies leftover in the DN can cause the condition - it
indeed goes away if you clear out the RBW directory in the DN.

Imagine this condition:
1. File is being written. Has replica on node X among others.
2. Replica write to node X in pipeline fails. Write carries on,
leaving stale block copy in RBW of node X.
3. File gets closed and deleted away soon/immediately after (but well
before a block report from X).
4. Block report now sends the RBW info but NN has no knowledge of the
block anymore.

I think modifying Colin's test this way should reproduce the issue:

1. start a mini dfs cluster with 2 datanodes
2. create a file with repl=2, but do not close it (flush it to ensure
on-disk RBW write)
3. take down one DN
4. close and delete the file
5. wait
6. bring back up the other DN, which will still have the RBW block
from the file which was deleted




-- 
Harsh J


> NullPointerException in BlockInfoContiguous causes block report failure
> -----------------------------------------------------------------------
>
>                 Key: HDFS-8113
>                 URL: https://issues.apache.org/jira/browse/HDFS-8113
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>            Reporter: Chengbing Liu
>            Assignee: Chengbing Liu
>         Attachments: HDFS-8113.patch
>
>
> The following copy constructor can throw NullPointerException if {{bc}} is 
> null.
> {code}
>   protected BlockInfoContiguous(BlockInfoContiguous from) {
>     this(from, from.bc.getBlockReplication());
>     this.bc = from.bc;
>   }
> {code}
> We have observed that some DataNodes keeps failing doing block reports with 
> NameNode. The stacktrace is as follows. Though we are not using the latest 
> version, the problem still exists.
> {quote}
> 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> RemoteException in offerService
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
> at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to