[
https://issues.apache.org/jira/browse/FLINK-24147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias updated FLINK-24147:
-----------------------------
Attachment: jobmanager.log
> HDFS lease issues on Flink retry
> --------------------------------
>
> Key: FLINK-24147
> URL: https://issues.apache.org/jira/browse/FLINK-24147
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Hadoop Compatibility
> Affects Versions: 1.14.0, 1.12.5, 1.13.2
> Reporter: Matthias
> Priority: Major
> Attachments: jobmanager.log
>
>
> This issue was brought up on the [ML thread "hdfs lease issues on flink
> retry"|https://lists.apache.org/x/thread.html/r9e5dc9cbd0a41b88565bd6c8c1c9d864ffdd343b4a96bd4dd0dd8a97@%3Cuser.flink.apache.org%3E].
> See attached jobmanager.log which was provided by the user.
> The user ran into {{FileAlreadyExistsException}} when it tried to create a
> file for which a lease already existed. [~dmvk] helped investigating this.
> The problem seems to be that we use a fixed retry id {{0}} in
> [HadoopOutputFormatBase:137|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-connectors/flink-hadoop-compatibility/src/main/java/org/apache/flink/api/java/hadoop/mapred/HadoopOutputFormatBase.java#L137].
> Each resource in HDFS is allowed to have only one Writer accessing it. The
> LeaseManager manages this through leases. It appears that we tried to access
> the same file through another task due to {{HadoopOutputFormatBase}}
> generating the same {{TaskAttemptId}}. The retry interval was shorter (in
> that case 10 seconds) than Hadoop's hard-coded soft lease limit of 1min (see
> [hadoop:HdfsConstants:62|https://github.com/apache/hadoop/blob/a9c1489e31e8f602de62bd3ecc517aa6597ab2f8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/HdfsConstants.java#L62]).
> We could be able to overcome this by adding a dynamic retry count instead of
> {{_0}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)