[jira] [Created] (HDFS-17157) Transient network failure in lease recovery could lead to a datanode in an inconsisetnt state for a long time

Haoze Wu (Jira) Fri, 11 Aug 2023 21:31:08 -0700

Haoze Wu created HDFS-17157:
-------------------------------

             Summary: Transient network failure in lease recovery could lead to 
a datanode in an inconsisetnt state for a long time
                 Key: HDFS-17157
                 URL: https://issues.apache.org/jira/browse/HDFS-17157
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: datanode
    Affects Versions: 2.0.0-alpha
            Reporter: Haoze Wu



This case is related to HDFS-12070.

In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to 
permanent block recovery failure and leaves the file open indefinitely.  In the 
patch, instead of failing the whole lease recovery process when the second 
stage of block recovery is failed at one datanode, the whole lease recovery 
process is failed if only these are failed for all the datanodes. 

Attached is the code snippet for the second stage of the block recovery, in 
BlockRecoveryWorker#syncBlock:
{code:java}
...
final List<BlockRecord> successList = new ArrayList<>();     
for (BlockRecord r : participatingList) {        
  try {          
    r.updateReplicaUnderRecovery(bpid, recoveryId, blockId,     
newBlock.getNumBytes());     
    successList.add(r);        
  } catch (IOException e) { 
...{code}
However, because of transient network failure, the RPC in 
updateReplicaUnderRecovery initiated from the primary datanode to another 
datanode could return an EOFException while the other side does not process the 
RPC at all or throw an IOException when reading from the socket. 
{code:java}
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495)
        at org.apache.hadoop.ipc.Client.call(Client.java:1437)
        at org.apache.hadoop.ipc.Client.call(Client.java:1347)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) {code}
Then if there is any other datanode in which the second stage of block recovery 
success, the lease recovery would be successful and close the file. However, 
the last block failed to be synced to that failed datanode and this 
inconsistent could potentially last for a very long time. 

To fix the issue, I propose adding a configurable retry of 
updateReplicaUnderRecovery RPC so that transient network failure could be 
tolerated. 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDFS-17157) Transient network failure in lease recovery could lead to a datanode in an inconsisetnt state for a long time

Reply via email to