Nitin Goswami created HDFS-10927:
------------------------------------

             Summary: Lease Recovery: File not getting closed on HDFS when 
block write operation fails
                 Key: HDFS-10927
                 URL: https://issues.apache.org/jira/browse/HDFS-10927
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: fs
    Affects Versions: 2.7.1
            Reporter: Nitin Goswami


HDFS was unable to close a file when block write operation failed because of 
too high disk usage.

Scenario:

HBase was writing WAL logs on HDFS and the disk usage was too high at that 
time. While writing these WAL logs, one of the blocks writes operation failed 
with the following exception:

2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/192.168.194.144:50010 remote=/192.168.192.162:43105]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(Unknown Source)
        at java.io.BufferedInputStream.read1(Unknown Source)
        at java.io.BufferedInputStream.read(Unknown Source)
        at java.io.DataInputStream.read(Unknown Source)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
        at java.lang.Thread.run(Unknown Source)

After this exception, HBase tried to close/rollover the WAL file but that call 
also failed and WAL file couldn't be closed. After this HBase closed the region 
server

After some time, Lease Recovery got triggered for this file and following 
exceptions starts occurring:

2016-09-13 11:51:11,743 WARN 
org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to obtain 
replica info for block 
(=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from 
datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null])
java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < 
getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW
  getNumBytes()     = 45524696
  getBytesOnDisk()  = 45483527
  getVisibleLength()= 45511557
  getVolume()       = /opt/reflex/data/yarn/datanode/current
  getBlockFile()    = 
/opt/reflex/data/yarn/datanode/current/BP-337226066-192.168.193.217-1468912147102/current/rbw/blk_1074859607
  bytesAcked=45511557
  bytesOnDisk=45483527
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2278)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2254)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2542)
        at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55)
        at 
org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2555)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2625)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:243)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2527)
        at java.lang.Thread.run(Unknown Source)

Expected Behaviour: Under all conditions lease recovery should have been done 
and file should have been closed.

Impact: Since the file couldn't be closed, HBase went into an incosistent state 
as it wasn't able to run through the WAL file after the region server restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to