[ https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887883#comment-16887883 ]
Chen Zhang commented on HDFS-10927: ----------------------------------- Sorry, I can't remember the detail at that time, that comment is too old. {quote}HBase RegionServer got an DiskFull exception while writing WAL files, and client failed after several times retry {quote} I mentioned the client retry several times before it failed, maybe the cluster is almost full at that time? But I really can't remember > Lease Recovery: File not getting closed on HDFS when block write operation > fails > -------------------------------------------------------------------------------- > > Key: HDFS-10927 > URL: https://issues.apache.org/jira/browse/HDFS-10927 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs > Affects Versions: 2.7.1 > Reporter: Nitin Goswami > Priority: Major > > HDFS was unable to close a file when block write operation failed because of > too high disk usage. > Scenario: > HBase was writing WAL logs on HDFS and the disk usage was too high at that > time. While writing these WAL logs, one of the blocks writes operation failed > with the following exception: > 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Exception for > BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899 > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/192.168.194.144:50010 remote=/192.168.192.162:43105] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.BufferedInputStream.fill(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.DataInputStream.read(Unknown Source) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251) > at java.lang.Thread.run(Unknown Source) > After this exception, HBase tried to close/rollover the WAL file but that > call also failed and WAL file couldn't be closed. After this HBase closed the > region server > After some time, Lease Recovery got triggered for this file and following > exceptions starts occurring: > 2016-09-13 11:51:11,743 WARN > org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to > obtain replica info for block > (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from > datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null]) > java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < > getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW > getNumBytes() = 45524696 > getBytesOnDisk() = 45483527 > getVisibleLength()= 45511557 > getVolume() = /opt/reflex/data/yarn/datanode/current > getBlockFile() = > /opt/reflex/data/yarn/datanode/current/BP-337226066-192.168.193.217-1468912147102/current/rbw/blk_1074859607 > bytesAcked=45511557 > bytesOnDisk=45483527 > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2278) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2254) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2542) > at > org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) > at > org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Unknown Source) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2555) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2625) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:243) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2527) > at java.lang.Thread.run(Unknown Source) > Expected Behaviour: Under all conditions lease recovery should have been done > and file should have been closed. > Impact: Since the file couldn't be closed, HBase went into an incosistent > state as it wasn't able to run through the WAL file after the region server > restart. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org