[ 
https://issues.apache.org/jira/browse/FLINK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971397#comment-15971397
 ] 

Seth Wiesman commented on FLINK-6306:
-------------------------------------

The idea is that data is buffered on local disk until checkpoint when a single 
put operation is performed to copy the data. You can think of it as bucketing 
by checkpoint. The key is that because bad data from failed checkpoints cannot 
be deleted, buckets containing files from successful checkpoints will get an 
empty flag file signifying them consistent. It is then up to consuming 
processes to only read from buckets with flag files with they require exactly 
once semantics, otherwise they can read ever file and get at least once. 
Attached is a poorly drawn diagram explaining how this works. 

I am actually already running this in production right now to write out to S3 
but it first requires FLINK-6315 so that I can differentiate between concurrent 
checkpoints and timed out checkpoints. Otherwise, I would only be able to 
guarantee exactly once is maxConcurrentCheckpoints was set to 1. 

> Sink for eventually consistent file systems
> -------------------------------------------
>
>                 Key: FLINK-6306
>                 URL: https://issues.apache.org/jira/browse/FLINK-6306
>             Project: Flink
>          Issue Type: New Feature
>          Components: filesystem-connector
>            Reporter: Seth Wiesman
>            Assignee: Seth Wiesman
>         Attachments: eventually-consistent-sink
>
>
> Currently Flink provides the BucketingSink as an exactly once method for 
> writing out to a file system. It provides these guarantees by moving files 
> through several stages and deleting or truncating files that get into a bad 
> state. While this is a powerful abstraction, it causes issues with eventually 
> consistent file systems such as Amazon's S3 where most operations (ie rename, 
> delete, truncate) are not guaranteed to become consistent within a reasonable 
> amount of time. Flink should provide a sink that provides exactly once writes 
> to a file system where only PUT operations are considered consistent. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to