[ 
https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888860#comment-16888860
 ] 

Chen Zhang edited comment on HDFS-10927 at 7/19/19 1:27 PM:
------------------------------------------------------------

Thanks [~hexiaoqiao] for finding that HDFS-11472 can fix this issue.

For your proposal
{quote}What I means above is that we could update replicaInfo#numBytes after 
send to downstream and flush to disk
{quote}
I thinks the current behavior is by design. In the source code, numBytes is 
actually bytesReceived, and visibleLength is actually bytesAcked. The 
[^appendDesign3.pdf] explains how to update this 2 fields:
 # Every DN update the bytesReceived when they receive a packet and then 
forward it to the next peer.
 # Then DN write the data to disk
 # After receiving the ack of peer, the DN updates bytesAcked

The design doc says: "This
 algorithm
 provides 
a
 trade-off 
on 
the 
write 

performance,
 data
durability, 
and
 algorithm
simplicity", because it 
"Parallelizes
 data/ack
 streaming
 in 
downstream 
pipeline 
and 
on‐disk 

writing".

If we update bytesReceived after send to downstream and flush to disk, we may 
*break the semantics of the field "bytesReceived"*. For example, if we send a 
packet to downstream succeed and flush to disk failed, the bytesReceived is not 
update but this DN was actually received this packet. 

I'm not meaning that the proposal is not feasible, I agree with [~jojochuang]:
{quote}I suspect changing its semantics would break stuff – pipeline recovery 
is a super complex subject and I'd elect to not change it if necessary.
{quote}


was (Author: zhangchen):
Thanks [~hexiaoqiao] for finding that HDFS-11472 can fix this issue.

For your proposal
{quote}What I means above is that we could update replicaInfo#numBytes after 
send to downstream and flush to disk
{quote}
I thinks the current behavior is by design. In the source code, numBytes is 
actually bytesReceived, and visibleLength is actually bytesAcked. The [design 
doc|https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf]
 explains how to update this 2 fields:
 # Every DN update the bytesReceived when they receive a packet and then 
forward it to the next peer.
 # Then DN write the data to disk
 # After receiving the ack of peer, the DN updates bytesAcked

The design says: "This
 algorithm
 provides 
a
 trade-off 
on 
the 
write 

performance,
 data
durability, 
and
 algorithm
simplicity", because it 
"Parallelizes
 data/ack
 streaming
 in 
downstream 
pipeline 
and 
on‐disk 

writing".

If we update bytesReceived after send to downstream and flush to disk, we may 
*break the semantics of the field "bytesReceived"*. For example, if we send a 
packet to downstream succeed and flush to disk failed, the bytesReceived is not 
update but this DN was actually received this packet. 

I'm not meaning that the proposal is not feasible, I agree with [~jojochuang]:
{quote}I suspect changing its semantics would break stuff – pipeline recovery 
is a super complex subject and I'd elect to not change it if necessary.
{quote}

> Lease Recovery: File not getting closed on HDFS when block write operation 
> fails
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-10927
>                 URL: https://issues.apache.org/jira/browse/HDFS-10927
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.7.1
>            Reporter: Nitin Goswami
>            Priority: Major
>
> HDFS was unable to close a file when block write operation failed because of 
> too high disk usage.
> Scenario:
> HBase was writing WAL logs on HDFS and the disk usage was too high at that 
> time. While writing these WAL logs, one of the blocks writes operation failed 
> with the following exception:
> 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/192.168.194.144:50010 remote=/192.168.192.162:43105]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(Unknown Source)
>         at java.io.BufferedInputStream.read1(Unknown Source)
>         at java.io.BufferedInputStream.read(Unknown Source)
>         at java.io.DataInputStream.read(Unknown Source)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
>         at java.lang.Thread.run(Unknown Source)
> After this exception, HBase tried to close/rollover the WAL file but that 
> call also failed and WAL file couldn't be closed. After this HBase closed the 
> region server
> After some time, Lease Recovery got triggered for this file and following 
> exceptions starts occurring:
> 2016-09-13 11:51:11,743 WARN 
> org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to 
> obtain replica info for block 
> (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from 
> datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null])
> java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < 
> getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW
>   getNumBytes()     = 45524696
>   getBytesOnDisk()  = 45483527
>   getVisibleLength()= 45511557
>   getVolume()       = /opt/reflex/data/yarn/datanode/current
>   getBlockFile()    = 
> /opt/reflex/data/yarn/datanode/current/BP-337226066-192.168.193.217-1468912147102/current/rbw/blk_1074859607
>   bytesAcked=45511557
>   bytesOnDisk=45483527
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2278)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2542)
>         at 
> org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Unknown Source)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown 
> Source)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
>         at java.lang.reflect.Constructor.newInstance(Unknown Source)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2555)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2625)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:243)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2527)
>         at java.lang.Thread.run(Unknown Source)
> Expected Behaviour: Under all conditions lease recovery should have been done 
> and file should have been closed.
> Impact: Since the file couldn't be closed, HBase went into an incosistent 
> state as it wasn't able to run through the WAL file after the region server 
> restart.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to