[
https://issues.apache.org/jira/browse/FLINK-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864489#comment-15864489
]
ASF GitHub Bot commented on FLINK-5788:
---------------------------------------
GitHub user StephanEwen opened a pull request:
https://github.com/apache/flink/pull/3301
[FLINK-5788] [docs] Improve documentation of FileSystem and spell out the
data persistence contract
This writes down the contract that the Flink `FileSystem` and
`FSDataOutputStream` implementations have to adhere to in order to support
proper consistency and failure recovery. The contract has so far been only
implicitly defined and adhered to by the checkpointing and high-availability
code.
## Contract
Data written to an `FSDataOutputStream` created from a `FileSystem` is
considered persistent, if two requirements are met:
1. **Visibility Requirement:** It must be guaranteed that all other
processes, machines,
virtual machines, containers, etc. that are able to access the file
see the data consistently
when given the absolute file path. This requirement is similar to the
*close-to-open*
semantics defined by POSIX, but restricted to the file itself (by its
absolute path).
2. **Durability Requirement:** The file system's specific
durability/persistence requirements
must be met. These are specific to the particular file system. For
example the
`LocalFileSystem` does not provide any durability guarantees for
crashes of both
hardware and operating system, while replicated distributed file
systems (like HDFS)
guarantee typically durability in the presence of up to concurrent
failure or *n*
nodes, where *n* is the replication factor.
Updates to the file's parent directory (such as that the file shows up when
listing the directory contents) are not required to be complete for the data in
the file stream to be considered persistent. This relaxation is important for
file systems where updates to directory contents are only eventually consistent
(like S3).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StephanEwen/incubator-flink filesystem_docs
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/3301.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3301
----
----
> Document assumptions about File Systems and persistence
> -------------------------------------------------------
>
> Key: FLINK-5788
> URL: https://issues.apache.org/jira/browse/FLINK-5788
> Project: Flink
> Issue Type: Improvement
> Components: Documentation
> Reporter: Stephan Ewen
> Assignee: Stephan Ewen
> Fix For: 1.3.0
>
>
> We should add some description about the assumptions we make for the behavior
> of {{FileSystem}} implementations to support proper checkpointing and
> recovery operations.
> This is especially critical for file systems like {{S3}} with a somewhat
> tricky contract.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)