[jira] [Updated] (HDFS-17157) Transient network failure in lease recovery could lead to the block in a datanode in an inconsisetnt state for a long time

2023-08-11 Thread Haoze Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoze Wu updated HDFS-17157:

Description: 
This case is related to HDFS-12070.

In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to 
permanent block recovery failure and leaves the file open indefinitely.  In the 
patch, instead of failing the whole lease recovery process when the second 
stage of block recovery is failed at one datanode, the whole lease recovery 
process is failed if only these are failed for all the datanodes. 

Attached is the code snippet for the second stage of the block recovery, in 
BlockRecoveryWorker#syncBlock:
{code:java}
...
final List successList = new ArrayList<>(); 
for (BlockRecord r : participatingList) {
  try {  
r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, 
newBlock.getNumBytes());     
successList.add(r);
  } catch (IOException e) { 
...{code}
However, because of transient network failure, the RPC in 
updateReplicaUnderRecovery initiated from the primary datanode to another 
datanode could return an EOFException while the other side does not process the 
RPC at all or throw an IOException when reading from the socket. 
{code:java}
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495)
        at org.apache.hadoop.ipc.Client.call(Client.java:1437)
        at org.apache.hadoop.ipc.Client.call(Client.java:1347)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) {code}
Then if there is any other datanode in which the second stage of block recovery 
success, the lease recovery would be successful and close the file. However, 
the last block failed to be synced to that failed datanode and this 
inconsistency could potentially last for a very long time. 

To fix the issue, I propose adding a configurable retry of 
updateReplicaUnderRecovery RPC so that transient network failure could be 
mitigated.

Any comments and suggestions would be appreciated.

 

  was:
This case is related to HDFS-12070.

In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to 
permanent block recovery failure and leaves the file open indefinitely.  In the 
patch, instead of failing the whole lease recovery process when the second 
stage of block recovery is failed at one datanode, the whole lease recovery 
process is failed if only these are failed for all the datanodes. 

Attached is the code snippet for the second stage of the block recovery, in 
BlockRecoveryWorker#syncBlock:
{code:java}
...
final List successList = new ArrayList<>(); 
for (BlockRecord r : participatingList) {
  try {  
r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, 
newBlock.getNumBytes());     
successList.add(r);
  } catch (IOException e) { 
...{code}
However, because of transient network failure, the RPC in 
updateReplicaUnderRecovery initiated from the primary datanode to another 
datanode could return an EOFException 

[jira] [Updated] (HDFS-17157) Transient network failure in lease recovery could lead to the block in a datanode in an inconsisetnt state for a long time

2023-08-11 Thread Haoze Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoze Wu updated HDFS-17157:

Description: 
This case is related to HDFS-12070.

In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to 
permanent block recovery failure and leaves the file open indefinitely.  In the 
patch, instead of failing the whole lease recovery process when the second 
stage of block recovery is failed at one datanode, the whole lease recovery 
process is failed if only these are failed for all the datanodes. 

Attached is the code snippet for the second stage of the block recovery, in 
BlockRecoveryWorker#syncBlock:
{code:java}
...
final List successList = new ArrayList<>(); 
for (BlockRecord r : participatingList) {
  try {  
r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, 
newBlock.getNumBytes());     
successList.add(r);
  } catch (IOException e) { 
...{code}
However, because of transient network failure, the RPC in 
updateReplicaUnderRecovery initiated from the primary datanode to another 
datanode could return an EOFException while the other side does not process the 
RPC at all or throw an IOException when reading from the socket. 
{code:java}
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495)
        at org.apache.hadoop.ipc.Client.call(Client.java:1437)
        at org.apache.hadoop.ipc.Client.call(Client.java:1347)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) {code}
Then if there is any other datanode in which the second stage of block recovery 
success, the lease recovery would be successful and close the file. However, 
the last block failed to be synced to that failed datanode and this 
inconsistency could potentially last for a very long time. 

To fix the issue, I propose adding a configurable retry of 
updateReplicaUnderRecovery RPC so that transient network failure could be 
mitigated.

 

 

  was:
This case is related to HDFS-12070.

In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to 
permanent block recovery failure and leaves the file open indefinitely.  In the 
patch, instead of failing the whole lease recovery process when the second 
stage of block recovery is failed at one datanode, the whole lease recovery 
process is failed if only these are failed for all the datanodes. 

Attached is the code snippet for the second stage of the block recovery, in 
BlockRecoveryWorker#syncBlock:
{code:java}
...
final List successList = new ArrayList<>(); 
for (BlockRecord r : participatingList) {
  try {  
r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, 
newBlock.getNumBytes());     
successList.add(r);
  } catch (IOException e) { 
...{code}
However, because of transient network failure, the RPC in 
updateReplicaUnderRecovery initiated from the primary datanode to another 
datanode could return an EOFException while the other side does not process the 
RPC at 

[jira] [Updated] (HDFS-17157) Transient network failure in lease recovery could lead to the block in a datanode in an inconsisetnt state for a long time

2023-08-11 Thread Haoze Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoze Wu updated HDFS-17157:

Description: 
This case is related to HDFS-12070.

In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to 
permanent block recovery failure and leaves the file open indefinitely.  In the 
patch, instead of failing the whole lease recovery process when the second 
stage of block recovery is failed at one datanode, the whole lease recovery 
process is failed if only these are failed for all the datanodes. 

Attached is the code snippet for the second stage of the block recovery, in 
BlockRecoveryWorker#syncBlock:
{code:java}
...
final List successList = new ArrayList<>(); 
for (BlockRecord r : participatingList) {
  try {  
r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, 
newBlock.getNumBytes());     
successList.add(r);
  } catch (IOException e) { 
...{code}
However, because of transient network failure, the RPC in 
updateReplicaUnderRecovery initiated from the primary datanode to another 
datanode could return an EOFException while the other side does not process the 
RPC at all or throw an IOException when reading from the socket. 
{code:java}
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495)
        at org.apache.hadoop.ipc.Client.call(Client.java:1437)
        at org.apache.hadoop.ipc.Client.call(Client.java:1347)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) {code}
Then if there is any other datanode in which the second stage of block recovery 
success, the lease recovery would be successful and close the file. However, 
the last block failed to be synced to that failed datanode and this 
inconsistency could potentially last for a very long time. 

To fix the issue, I propose adding a configurable retry of 
updateReplicaUnderRecovery RPC so that transient network failure could be 
tolerated. 

 

 

  was:
This case is related to HDFS-12070.

In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to 
permanent block recovery failure and leaves the file open indefinitely.  In the 
patch, instead of failing the whole lease recovery process when the second 
stage of block recovery is failed at one datanode, the whole lease recovery 
process is failed if only these are failed for all the datanodes. 

Attached is the code snippet for the second stage of the block recovery, in 
BlockRecoveryWorker#syncBlock:
{code:java}
...
final List successList = new ArrayList<>(); 
for (BlockRecord r : participatingList) {
  try {  
r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, 
newBlock.getNumBytes());     
successList.add(r);
  } catch (IOException e) { 
...{code}
However, because of transient network failure, the RPC in 
updateReplicaUnderRecovery initiated from the primary datanode to another 
datanode could return an EOFException while the other side does not process the 
RPC at 

[jira] [Updated] (HDFS-17157) Transient network failure in lease recovery could lead to the block in a datanode in an inconsisetnt state for a long time

2023-08-11 Thread Haoze Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoze Wu updated HDFS-17157:

Summary: Transient network failure in lease recovery could lead to the 
block in a datanode in an inconsisetnt state for a long time  (was: Transient 
network failure in lease recovery could lead to a datanode in an inconsisetnt 
state for a long time)

> Transient network failure in lease recovery could lead to the block in a 
> datanode in an inconsisetnt state for a long time
> --
>
> Key: HDFS-17157
> URL: https://issues.apache.org/jira/browse/HDFS-17157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.0.0-alpha
>Reporter: Haoze Wu
>Priority: Major
>
> This case is related to HDFS-12070.
> In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to 
> permanent block recovery failure and leaves the file open indefinitely.  In 
> the patch, instead of failing the whole lease recovery process when the 
> second stage of block recovery is failed at one datanode, the whole lease 
> recovery process is failed if only these are failed for all the datanodes. 
> Attached is the code snippet for the second stage of the block recovery, in 
> BlockRecoveryWorker#syncBlock:
> {code:java}
> ...
> final List successList = new ArrayList<>(); 
> for (BlockRecord r : participatingList) {
>   try {  
> r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, 
> newBlock.getNumBytes());     
> successList.add(r);
>   } catch (IOException e) { 
> ...{code}
> However, because of transient network failure, the RPC in 
> updateReplicaUnderRecovery initiated from the primary datanode to another 
> datanode could return an EOFException while the other side does not process 
> the RPC at all or throw an IOException when reading from the socket. 
> {code:java}
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1437)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>         at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) 
> {code}
> Then if there is any other datanode in which the second stage of block 
> recovery success, the lease recovery would be successful and close the file. 
> However, the last block failed to be synced to that failed datanode and this 
> inconsistent could potentially last for a very long time. 
> To fix the issue, I propose adding a configurable retry of 
> updateReplicaUnderRecovery RPC so that transient network failure could be 
> tolerated. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)