[
https://issues.apache.org/jira/browse/FLINK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971397#comment-15971397
]
Seth Wiesman commented on FLINK-6306:
-------------------------------------
The idea is that data is buffered on local disk until checkpoint when a single
put operation is performed to copy the data. You can think of it as bucketing
by checkpoint. The key is that because bad data from failed checkpoints cannot
be deleted, buckets containing files from successful checkpoints will get an
empty flag file signifying them consistent. It is then up to consuming
processes to only read from buckets with flag files with they require exactly
once semantics, otherwise they can read ever file and get at least once.
Attached is a poorly drawn diagram explaining how this works.
I am actually already running this in production right now to write out to S3
but it first requires FLINK-6315 so that I can differentiate between concurrent
checkpoints and timed out checkpoints. Otherwise, I would only be able to
guarantee exactly once is maxConcurrentCheckpoints was set to 1.
> Sink for eventually consistent file systems
> -------------------------------------------
>
> Key: FLINK-6306
> URL: https://issues.apache.org/jira/browse/FLINK-6306
> Project: Flink
> Issue Type: New Feature
> Components: filesystem-connector
> Reporter: Seth Wiesman
> Assignee: Seth Wiesman
> Attachments: eventually-consistent-sink
>
>
> Currently Flink provides the BucketingSink as an exactly once method for
> writing out to a file system. It provides these guarantees by moving files
> through several stages and deleting or truncating files that get into a bad
> state. While this is a powerful abstraction, it causes issues with eventually
> consistent file systems such as Amazon's S3 where most operations (ie rename,
> delete, truncate) are not guaranteed to become consistent within a reasonable
> amount of time. Flink should provide a sink that provides exactly once writes
> to a file system where only PUT operations are considered consistent.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)