[
https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888385#comment-16888385
]
Wei-Chiu Chuang edited comment on HDFS-10927 at 7/18/19 10:44 PM:
------------------------------------------------------------------
+1 to what Erik said. HDFS-11472 is the fix for the problem.
I don't have a copy of code with me, but IIRC replicaInfo#numBytes is updated
when (1) data is flushed to disk *and* (2) downstream DataNode acknowledges
that it receives the bytes as well. I suspect changing its semantics would
break stuff -- pipeline recovery is a super complex subject and I'd elect to
not change it if necessary.
was (Author: jojochuang):
+1 to what Erik said. HDFS-11472 is the fix for the problem.
I don't have a copy of code with me, but IIRC replicaInfo#numBytes is updated
when (1) data is flushed to disk *and* (2) downstream DataNode acknowledges
that it receives the bytes as well.
> Lease Recovery: File not getting closed on HDFS when block write operation
> fails
> --------------------------------------------------------------------------------
>
> Key: HDFS-10927
> URL: https://issues.apache.org/jira/browse/HDFS-10927
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: fs
> Affects Versions: 2.7.1
> Reporter: Nitin Goswami
> Priority: Major
>
> HDFS was unable to close a file when block write operation failed because of
> too high disk usage.
> Scenario:
> HBase was writing WAL logs on HDFS and the disk usage was too high at that
> time. While writing these WAL logs, one of the blocks writes operation failed
> with the following exception:
> 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Exception for
> BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/192.168.194.144:50010 remote=/192.168.192.162:43105]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.BufferedInputStream.fill(Unknown Source)
> at java.io.BufferedInputStream.read1(Unknown Source)
> at java.io.BufferedInputStream.read(Unknown Source)
> at java.io.DataInputStream.read(Unknown Source)
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
> at java.lang.Thread.run(Unknown Source)
> After this exception, HBase tried to close/rollover the WAL file but that
> call also failed and WAL file couldn't be closed. After this HBase closed the
> region server
> After some time, Lease Recovery got triggered for this file and following
> exceptions starts occurring:
> 2016-09-13 11:51:11,743 WARN
> org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to
> obtain replica info for block
> (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from
> datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null])
> java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() <
> getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW
> getNumBytes() = 45524696
> getBytesOnDisk() = 45483527
> getVisibleLength()= 45511557
> getVolume() = /opt/reflex/data/yarn/datanode/current
> getBlockFile() =
> /opt/reflex/data/yarn/datanode/current/BP-337226066-192.168.193.217-1468912147102/current/rbw/blk_1074859607
> bytesAcked=45511557
> bytesOnDisk=45483527
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2278)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2254)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2542)
> at
> org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55)
> at
> org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Unknown Source)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2555)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2625)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:243)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2527)
> at java.lang.Thread.run(Unknown Source)
> Expected Behaviour: Under all conditions lease recovery should have been done
> and file should have been closed.
> Impact: Since the file couldn't be closed, HBase went into an incosistent
> state as it wasn't able to run through the WAL file after the region server
> restart.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]