[
https://issues.apache.org/jira/browse/FLINK-11116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-11116:
-----------------------------------
Labels: pull-request-available stale-major (was: pull-request-available)
> Overwrite outdated in-progress files in StreamingFileSink.
> ----------------------------------------------------------
>
> Key: FLINK-11116
> URL: https://issues.apache.org/jira/browse/FLINK-11116
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Affects Versions: 1.7.0
> Reporter: Kostas Kloudas
> Priority: Major
> Labels: pull-request-available, stale-major
> Fix For: 1.7.3
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In order to guarantee exactly-once semantics, the streaming file sink is
> implementing a two-phase commit protocol when writing files to the filesystem.
> Initially data is written to in-progress files. These files are then put into
> "pending" state when they are completed (based on the rolling policy), and
> they are finally committed when the checkpoint that put them in the "pending"
> state is acknowledged as complete.
> The above shows that in the case that we have:
> 1) checkpoints A, B, C coming
> 2) checkpoint A being acknowledged and
> 3) failure
> Then we may have files that do not belong to any checkpoint (because B and C
> were not considered successful). These files are currently not cleaned up.
> In order to reduce the amount of such files created, we removed the random
> suffix from in-progress temporary files, so that the next in-progress file
> that is opened for this part, overwrites them.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)