[
https://issues.apache.org/jira/browse/FLINK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362483#comment-15362483
]
ASF GitHub Bot commented on FLINK-4133:
---------------------------------------
Github user kl0u commented on a diff in the pull request:
https://github.com/apache/flink/pull/2198#discussion_r69561903
--- Diff: docs/apis/streaming/index.md ---
@@ -1310,21 +1310,28 @@ Data Sources
<br />
-Sources can by created by using
`StreamExecutionEnvironment.addSource(sourceFunction)`.
-You can either use one of the source functions that come with Flink or
write a custom source
-by implementing the `SourceFunction` for non-parallel sources, or by
implementing the
-`ParallelSourceFunction` interface or extending
`RichParallelSourceFunction` for parallel sources.
+Sources are where your program reads its input from. You can attach a
source to your program by
+using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes
with a number of pre-implemented
+source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
+for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
+`RichParallelSourceFunction` for parallel sources.
There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
File-based:
-- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and
returns them as Strings.
+- `readTextFile(path)` - Reads text files, i.e. files that respect the
`TextInputFormat` specification, line-by-line and returns them as Strings.
-- `readFile(path)` / Any input format - Reads files as dictated by the
input format.
-
-- `readFileStream` - create a stream by appending elements when there are
changes to a file
+- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by
the specified file input format.
+- `readFile(fileInputFormat, path, watchType, interval, pathFilter,
typeInfo)` - This is the method called internally by the two previous ones. It
reads files in the `path` based on the given `fileInputFormat`. Depending on
the provided `watchType`, this source may periodically monitor (every
`interval` ms) the path for new data
(`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently
in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the
`pathFilter`, the user can further exclude files from being processed.
+
+ *IMPORTANT NOTES:*
+
+ 1. If the `watchType` is set to
`FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its
contents are re-processed entirely. This can brake the "exactly-once"
semantics, as appending data at the end of a file will lead to **all** its
contents being re-processed.
+
+ 2. If the `watchType` is set to `FileProcessingMode#PROCESS_ONCE`, the
source scans the path **once** and exits, without waiting for the readers to
finish reading. This leads to no more checkpoints after that point, thus
providing reduced fault-tolerance guarantees.
+
--- End diff --
Will do that!
> Reflect streaming file source changes in documentation
> ------------------------------------------------------
>
> Key: FLINK-4133
> URL: https://issues.apache.org/jira/browse/FLINK-4133
> Project: Flink
> Issue Type: Bug
> Components: DataStream API, Documentation
> Reporter: Robert Metzger
> Assignee: Kostas Kloudas
>
> In FLINK-2314 the file sources for the DataStream API were reworked.
> The documentation doesn't explain the (new?) semantics of the file sources.
> In which order are files read?
> How are file modifications treated? (appends, in place modifications?)
> I suspect this table is also not up-to-date:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/fault_tolerance.html#fault-tolerance-guarantees-of-data-sources-and-sinks
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)