[
https://issues.apache.org/jira/browse/FLINK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143244#comment-16143244
]
ASF GitHub Bot commented on FLINK-6306:
---------------------------------------
GitHub user sjwiesman opened a pull request:
https://github.com/apache/flink/pull/4607
[FLINK-6306][connectors] Sink for eventually consistent file systems
## What is the purpose of the change
This pull request implements a sink for writing out to an eventually
consistent filesystem, such as Amazon S3, with exactly once semantics.
## Brief change log
- The sink stages files on a consistent filesystem (local, hdfs, etc) .
- Once per checkpoint, files are copied to the eventually consistent
filesystem.
- When a checkpoint completion notification is sent, the files are marked
consistent. Otherwise, they are left because delete is not a consistent
operation.
- It is up to consumers to choose their semantics; at least once by
reading all files, or exactly once by only reading files marked consistent.
## Verifying this change
This change added tests and can be verified as follows:
- Added tests based on the existing BucketingSink test suite.
- Added tests that verify semantics based on different checkpointing
combinations (successful, concurrent, timed out, and failed).
- Added integration test that verifies exactly once holds during failure.
- Manually verified by having run in production for several months.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Yarn/Mesos, ZooKeeper:no
## Documentation
- Does this pull request introduce a new feature? yes
- If yes, how is the feature documented? JavaDocs
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sjwiesman/flink FLINK-6306
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4607.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4607
----
commit 347ea767195d74efc39964c02ace1bbe10d8aa0a
Author: Seth Wiesman <[email protected]>
Date: 2017-08-27T21:36:04Z
[FLINK-6306][connectors] Sink for eventually consistent file systems
----
> Sink for eventually consistent file systems
> -------------------------------------------
>
> Key: FLINK-6306
> URL: https://issues.apache.org/jira/browse/FLINK-6306
> Project: Flink
> Issue Type: New Feature
> Components: filesystem-connector
> Reporter: Seth Wiesman
> Assignee: Seth Wiesman
> Attachments: eventually-consistent-sink
>
>
> Currently Flink provides the BucketingSink as an exactly once method for
> writing out to a file system. It provides these guarantees by moving files
> through several stages and deleting or truncating files that get into a bad
> state. While this is a powerful abstraction, it causes issues with eventually
> consistent file systems such as Amazon's S3 where most operations (ie rename,
> delete, truncate) are not guaranteed to become consistent within a reasonable
> amount of time. Flink should provide a sink that provides exactly once writes
> to a file system where only PUT operations are considered consistent.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)