[jira] [Comment Edited] (HDFS-10927) Lease Recovery: File not getting closed on HDFS when block write operation fails

Wei-Chiu Chuang (JIRA) Thu, 18 Jul 2019 15:45:50 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888385#comment-16888385
 ]


Wei-Chiu Chuang edited comment on HDFS-10927 at 7/18/19 10:44 PM:
------------------------------------------------------------------

+1 to what Erik said. HDFS-11472 is the fix for the problem.
I don't have a copy of code with me, but IIRC replicaInfo#numBytes is updated 
when (1) data is flushed to disk *and* (2) downstream DataNode acknowledges 
that it receives the bytes as well. I suspect changing its semantics would 
break stuff -- pipeline recovery is a super complex subject and I'd elect to 
not change it if necessary.


was (Author: jojochuang):
+1 to what Erik said. HDFS-11472 is the fix for the problem.
I don't have a copy of code with me, but IIRC replicaInfo#numBytes is updated 
when (1) data is flushed to disk *and* (2) downstream DataNode acknowledges 
that it receives the bytes as well.

> Lease Recovery: File not getting closed on HDFS when block write operation 
> fails
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-10927
>                 URL: https://issues.apache.org/jira/browse/HDFS-10927
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.7.1
>            Reporter: Nitin Goswami
>            Priority: Major
>
> HDFS was unable to close a file when block write operation failed because of 
> too high disk usage.
> Scenario:
> HBase was writing WAL logs on HDFS and the disk usage was too high at that 
> time. While writing these WAL logs, one of the blocks writes operation failed 
> with the following exception:
> 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/192.168.194.144:50010 remote=/192.168.192.162:43105]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(Unknown Source)
>         at java.io.BufferedInputStream.read1(Unknown Source)
>         at java.io.BufferedInputStream.read(Unknown Source)
>         at java.io.DataInputStream.read(Unknown Source)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
>         at java.lang.Thread.run(Unknown Source)
> After this exception, HBase tried to close/rollover the WAL file but that 
> call also failed and WAL file couldn't be closed. After this HBase closed the 
> region server
> After some time, Lease Recovery got triggered for this file and following 
> exceptions starts occurring:
> 2016-09-13 11:51:11,743 WARN 
> org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to 
> obtain replica info for block 
> (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from 
> datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null])
> java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < 
> getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW
>   getNumBytes()     = 45524696
>   getBytesOnDisk()  = 45483527
>   getVisibleLength()= 45511557
>   getVolume()       = /opt/reflex/data/yarn/datanode/current
>   getBlockFile()    = 
> /opt/reflex/data/yarn/datanode/current/BP-337226066-192.168.193.217-1468912147102/current/rbw/blk_1074859607
>   bytesAcked=45511557
>   bytesOnDisk=45483527
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2278)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2542)
>         at 
> org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Unknown Source)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown 
> Source)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
>         at java.lang.reflect.Constructor.newInstance(Unknown Source)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2555)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2625)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:243)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2527)
>         at java.lang.Thread.run(Unknown Source)
> Expected Behaviour: Under all conditions lease recovery should have been done 
> and file should have been closed.
> Impact: Since the file couldn't be closed, HBase went into an incosistent 
> state as it wasn't able to run through the WAL file after the region server 
> restart.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-10927) Lease Recovery: File not getting closed on HDFS when block write operation fails

Reply via email to