[ 
https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zinan Zhuang updated HDFS-17553:
--------------------------------
    Summary: DFSOutputStream.java#closeImpl should have configurable retries 
upon flushInternal failures  (was: DFSOutputStream.java#closeImpl should 
configurable retries upon flushInternal failures)

> DFSOutputStream.java#closeImpl should have configurable retries upon 
> flushInternal failures
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17553
>                 URL: https://issues.apache.org/jira/browse/HDFS-17553
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: dfsclient
>    Affects Versions: 3.3.1, 3.4.0
>            Reporter: Zinan Zhuang
>            Priority: Major
>
> HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the 
> waitForAckedSeqno call when timeout has exceeded, which throws an 
> InterrupttedIOExceptions. This method is being used in 
> [DFSOutputStream.java#flushInternal 
> |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
>  , one of whose use case is 
> [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
>  to close a file.
> What we saw was that we were getting InterrupttedIOExceptions during the 
> flushInternal call when we were closing out a file, which was unhandled by 
> DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when 
> a file failed to close on HDFS side, block recovery was not called and the 
> lease got leaked until the DFSClient gets recycled. In our HBase setups, 
> DFSClients remain long-lived in regionservers, which means these files remain 
> undead until the corresponding regionservers get restarted.
> This issue was observed during datanode decomission because it was stuck on 
> open files caused by above leakage. As it's good to close a HDFS file as 
> smooth as possible, retries of flushInternal during closeImpl operations 
> would be beneficial to reduce such leakages. The number of retries can be 
> based on dfsclient config. [For 
> example|https://github.com/apache/hadoop/blob/63633812a417a3d548be3bdcecebd2ae893d03e0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1660]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to