[
https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zinan Zhuang updated HDFS-17553:
--------------------------------
Description:
[HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an
interrupt in DFSStreamer class to interrupt the
waitForAckedSeqno call when timeout has exceeded, which throws an
InterrupttedIOExceptions. This method is being used in
[DFSOutputStream.java#flushInternal
|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
, one of whose use case is
[DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
to close a file.
What we saw was that we were getting InterrupttedIOExceptions during the
flushInternal call when we were closing out a file, which was unhandled by
DFSClient and got thrown to caller. There's a known issue
[HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file
failed to close on HDFS side, the lease got leaked until the DFSClient gets
recycled. In our HBase setups, DFSClients remain long-lived in regionservers,
which means these files remain undead until the corresponding regionservers get
restarted.
This issue was observed during datanode decomission because it was stuck on
open files caused by above leakage. As it's good to close a HDFS file as smooth
as possible, a retry of flushInternal during closeImpl operations would be
beneficial to reduce such leakages.
was:
[HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an
interrupt in DFSStreamer class to interrupt the
waitForAckedSeqno call when timeout has exceeded, which throws an
InterrupttedIOExceptions. This method is being used in
[DFSOutputStream.java#flushInternal
|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
, one of whose use case is
[DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
to close a file.
What we saw was that we were getting InterrupttedIOExceptions during the
flushInternal call when we are closing out a file, which was unhandled by
DFSClient and got thrown to caller. There's a known issue
[HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file
failed to close on HDFS side, the lease got leaked until the DFSClient gets
recycled. In our HBase setups, DFSClients remain long-lived in each
regionserver, which means these files remain undead until the regionserver gets
restarted.
This issue was observed during datanode decomission because it was stuck on
open files caused by above leakage. As it's good to close a HDFS file as smooth
as possible, a retry of flushInternal during closeImpl operations would be
beneficial to reduce such leakages.
> DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures
> ------------------------------------------------------------------------------
>
> Key: HDFS-17553
> URL: https://issues.apache.org/jira/browse/HDFS-17553
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: dfsclient
> Affects Versions: 3.3.1, 3.4.0
> Reporter: Zinan Zhuang
> Priority: Major
>
> [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an
> interrupt in DFSStreamer class to interrupt the
> waitForAckedSeqno call when timeout has exceeded, which throws an
> InterrupttedIOExceptions. This method is being used in
> [DFSOutputStream.java#flushInternal
> |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
> , one of whose use case is
> [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
> to close a file.
> What we saw was that we were getting InterrupttedIOExceptions during the
> flushInternal call when we were closing out a file, which was unhandled by
> DFSClient and got thrown to caller. There's a known issue
> [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file
> failed to close on HDFS side, the lease got leaked until the DFSClient gets
> recycled. In our HBase setups, DFSClients remain long-lived in regionservers,
> which means these files remain undead until the corresponding regionservers
> get restarted.
> This issue was observed during datanode decomission because it was stuck on
> open files caused by above leakage. As it's good to close a HDFS file as
> smooth as possible, a retry of flushInternal during closeImpl operations
> would be beneficial to reduce such leakages.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]