[
https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zinan Zhuang updated HDFS-17553:
--------------------------------
Summary: DFSOutputStream.java#closeImpl should have configurable retries
upon flushInternal failures (was: DFSOutputStream.java#closeImpl should
configurable retries upon flushInternal failures)
> DFSOutputStream.java#closeImpl should have configurable retries upon
> flushInternal failures
> -------------------------------------------------------------------------------------------
>
> Key: HDFS-17553
> URL: https://issues.apache.org/jira/browse/HDFS-17553
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: dfsclient
> Affects Versions: 3.3.1, 3.4.0
> Reporter: Zinan Zhuang
> Priority: Major
>
> HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the
> waitForAckedSeqno call when timeout has exceeded, which throws an
> InterrupttedIOExceptions. This method is being used in
> [DFSOutputStream.java#flushInternal
> |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
> , one of whose use case is
> [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
> to close a file.
> What we saw was that we were getting InterrupttedIOExceptions during the
> flushInternal call when we were closing out a file, which was unhandled by
> DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when
> a file failed to close on HDFS side, block recovery was not called and the
> lease got leaked until the DFSClient gets recycled. In our HBase setups,
> DFSClients remain long-lived in regionservers, which means these files remain
> undead until the corresponding regionservers get restarted.
> This issue was observed during datanode decomission because it was stuck on
> open files caused by above leakage. As it's good to close a HDFS file as
> smooth as possible, retries of flushInternal during closeImpl operations
> would be beneficial to reduce such leakages. The number of retries can be
> based on dfsclient config. [For
> example|https://github.com/apache/hadoop/blob/63633812a417a3d548be3bdcecebd2ae893d03e0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1660]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]