[ 
https://issues.apache.org/jira/browse/FLINK-24147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias updated FLINK-24147:
-----------------------------
    Attachment: jobmanager.log

> HDFS lease issues on Flink retry
> --------------------------------
>
>                 Key: FLINK-24147
>                 URL: https://issues.apache.org/jira/browse/FLINK-24147
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Hadoop Compatibility
>    Affects Versions: 1.14.0, 1.12.5, 1.13.2
>            Reporter: Matthias
>            Priority: Major
>         Attachments: jobmanager.log
>
>
> This issue was brought up on the [ML thread "hdfs lease issues on flink 
> retry"|https://lists.apache.org/x/thread.html/r9e5dc9cbd0a41b88565bd6c8c1c9d864ffdd343b4a96bd4dd0dd8a97@%3Cuser.flink.apache.org%3E].
>  See attached jobmanager.log which was provided by the user.
> The user ran into {{FileAlreadyExistsException}} when it tried to create a 
> file for which a lease already existed. [~dmvk] helped investigating this.
> The problem seems to be that we use a fixed retry id {{0}} in 
> [HadoopOutputFormatBase:137|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-connectors/flink-hadoop-compatibility/src/main/java/org/apache/flink/api/java/hadoop/mapred/HadoopOutputFormatBase.java#L137].
> Each resource in HDFS is allowed to have only one Writer accessing it. The 
> LeaseManager manages this through leases. It appears that we tried to access 
> the same file through another task due to {{HadoopOutputFormatBase}} 
> generating the same {{TaskAttemptId}}. The retry interval was shorter (in 
> that case 10 seconds) than Hadoop's hard-coded soft lease limit of 1min (see 
> [hadoop:HdfsConstants:62|https://github.com/apache/hadoop/blob/a9c1489e31e8f602de62bd3ecc517aa6597ab2f8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/HdfsConstants.java#L62]).
> We could be able to overcome this by adding a dynamic retry count instead of 
> {{_0}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to