[ 
https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871310#comment-15871310
 ] 

Chen Zhang edited comment on HDFS-10927 at 2/17/17 7:25 AM:
------------------------------------------------------------

Hey guys, I just found an issue like this.
HBase RegionServer got an DiskFull exception while writing WAL files, and 
client failed after several times retry. When Master trying to use recoverLease 
to recover these file, we got almost same logs as [~ngoswami] attached
{quote}
java.io.IOException: File length mismatched.  The length of 
/home/work/ssd11/hdfs/xxxx/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW
  getNumBytes()     = 174437376
  getBytesOnDisk()  = 174429752
  getVisibleLength()= 174429752
  getVolume()       = /home/work/ssd11/hdfs/xxxxx/datanode/current
  getBlockFile()    = 
/home/work/ssd11/hdfs/xxxxx/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
  bytesAcked=174429752
  bytesOnDisk=174429752
{quote}

In my case, it's caused by the exception while out.write() in receivePacket() 
of BlockReceiver. 
receivePacket() first update numbytes in replicaInfo, then write data to disk, 
and update bytesOnDisk at last, the DiskFull exception makes the length not 
consistent.



was (Author: zhangchen):
Hey guys, I just found an issue like this.
HBase RegionServer got an DiskFull exception while writing WAL files, and 
client failed after several times retry. When Master trying to use recoverLease 
to recover these file, we got almost same logs as [~ngoswami] attached
{quote}
java.io.IOException: File length mismatched.  The length of 
/home/work/ssd11/hdfs/xxxx/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW
  getNumBytes()     = 174437376
  getBytesOnDisk()  = 174429752
  getVisibleLength()= 174429752
  getVolume()       = /home/work/ssd11/hdfs/xxxxx/datanode/current
  getBlockFile()    = 
/home/work/ssd11/hdfs/xxxxx/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
  bytesAcked=174429752
  bytesOnDisk=174429752
{quote}

It's caused by the exception while out.write() in receivePacket() of 
BlockReceiver. 
receivePacket() first update numbytes in replicaInfo, then write data to disk, 
and update bytesOnDisk at last, the DiskFull exception makes the length not 
consistent.


> Lease Recovery: File not getting closed on HDFS when block write operation 
> fails
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-10927
>                 URL: https://issues.apache.org/jira/browse/HDFS-10927
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.7.1
>            Reporter: Nitin Goswami
>
> HDFS was unable to close a file when block write operation failed because of 
> too high disk usage.
> Scenario:
> HBase was writing WAL logs on HDFS and the disk usage was too high at that 
> time. While writing these WAL logs, one of the blocks writes operation failed 
> with the following exception:
> 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/192.168.194.144:50010 remote=/192.168.192.162:43105]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(Unknown Source)
>         at java.io.BufferedInputStream.read1(Unknown Source)
>         at java.io.BufferedInputStream.read(Unknown Source)
>         at java.io.DataInputStream.read(Unknown Source)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
>         at java.lang.Thread.run(Unknown Source)
> After this exception, HBase tried to close/rollover the WAL file but that 
> call also failed and WAL file couldn't be closed. After this HBase closed the 
> region server
> After some time, Lease Recovery got triggered for this file and following 
> exceptions starts occurring:
> 2016-09-13 11:51:11,743 WARN 
> org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to 
> obtain replica info for block 
> (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from 
> datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null])
> java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < 
> getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW
>   getNumBytes()     = 45524696
>   getBytesOnDisk()  = 45483527
>   getVisibleLength()= 45511557
>   getVolume()       = /opt/reflex/data/yarn/datanode/current
>   getBlockFile()    = 
> /opt/reflex/data/yarn/datanode/current/BP-337226066-192.168.193.217-1468912147102/current/rbw/blk_1074859607
>   bytesAcked=45511557
>   bytesOnDisk=45483527
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2278)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2542)
>         at 
> org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Unknown Source)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown 
> Source)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
>         at java.lang.reflect.Constructor.newInstance(Unknown Source)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2555)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2625)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:243)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2527)
>         at java.lang.Thread.run(Unknown Source)
> Expected Behaviour: Under all conditions lease recovery should have been done 
> and file should have been closed.
> Impact: Since the file couldn't be closed, HBase went into an incosistent 
> state as it wasn't able to run through the WAL file after the region server 
> restart.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to