Stephan Ewen created FLINK-3515:
-----------------------------------
Summary: Make the "file monitoring source" exactly-once
Key: FLINK-3515
URL: https://issues.apache.org/jira/browse/FLINK-3515
Project: Flink
Issue Type: Improvement
Components: Streaming
Affects Versions: 0.10.2
Reporter: Stephan Ewen
The stream source that watches directories for changes is currently not
"exactly-once".
To make it exactly once, the source (that generates files to be read) and the
flatMap (that reads the files) need to keep track of where they were at the
point of a checkpoint.
Assuming that files do not change after creation (HDFS / S3 style), we can make
this the following way:
- The source can track the files it already emitted downstream via file
creation/modification timestamp, assuming that new files always get newer
timestamps.
- The flatMappers need to always store the path of their current file
fragment, plus the byte offset where they were within that file split.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)