[jira] [Created] (HDFS-13758) DatanodeManager should throw exception if it has BlockRecoveryCommand but the block is not under construction

Wei-Chiu Chuang (JIRA) Fri, 20 Jul 2018 14:42:08 -0700

Wei-Chiu Chuang created HDFS-13758:
--------------------------------------

             Summary: DatanodeManager should throw exception if it has 
BlockRecoveryCommand but the block is not under construction
                 Key: HDFS-13758
                 URL: https://issues.apache.org/jira/browse/HDFS-13758
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.0.0-alpha1
            Reporter: Wei-Chiu Chuang



In Hadoop 3, HDFS-8909 added an assertion assumption that if a 
BlockRecoveryCommand exists for a block, the block is under construction.

 
{code:title=DatanodeManager#getBlockRecoveryCommand()}

  BlockRecoveryCommand brCommand = new BlockRecoveryCommand(blocks.length);
  for (BlockInfo b : blocks) {
    BlockUnderConstructionFeature uc = b.getUnderConstructionFeature();
    assert uc != null;
...
{code}
This assertion accidentally fixed one of the possible scenario of HDFS-10240 
data corruption, if a recoverLease() is made immediately followed by a close(), 
before DataNodes have the chance to heartbeat.

In a unit test you'll get:
{noformat}
2018-07-19 09:43:41,331 [IPC Server handler 9 on 57890] WARN  ipc.Server 
(Server.java:logException(2724)) - IPC Server handler 9 on 57890, call Call#41 
Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat 
from 127.0.0.1:57903
java.lang.AssertionError
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getBlockRecoveryCommand(DatanodeManager.java:1551)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleHeartbeat(DatanodeManager.java:1661)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleHeartbeat(FSNamesystem.java:3865)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1504)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:119)
        at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:31660)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
{noformat}

I propose to change this assertion even though it address the data corruption, 
because:
# We should throw an more meaningful exception than an NPE
# on a production cluster, the assert is ignored, and you'll get a more 
noticeable NPE. Future HDFS developers might fix this NPE, causing regression. 
An NPE is typically not captured and handled, so there's a chance to result in 
internal state inconsistency.
# It doesn't address all possible scenarios of HDFS-10240. A proper fix should 
reject close() if the block is being recovered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDFS-13758) DatanodeManager should throw exception if it has BlockRecoveryCommand but the block is not under construction

Reply via email to