[
https://issues.apache.org/jira/browse/FLINK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362482#comment-15362482
]
ASF GitHub Bot commented on FLINK-4133:
---------------------------------------
Github user kl0u commented on a diff in the pull request:
https://github.com/apache/flink/pull/2198#discussion_r69561860
--- Diff: docs/apis/streaming/index.md ---
@@ -1310,21 +1310,28 @@ Data Sources
<br />
-Sources can by created by using
`StreamExecutionEnvironment.addSource(sourceFunction)`.
-You can either use one of the source functions that come with Flink or
write a custom source
-by implementing the `SourceFunction` for non-parallel sources, or by
implementing the
-`ParallelSourceFunction` interface or extending
`RichParallelSourceFunction` for parallel sources.
+Sources are where your program reads its input from. You can attach a
source to your program by
+using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes
with a number of pre-implemented
+source functions, but you can always write your own custom sources by
implementing the `SourceFunction`
+for non-parallel sources, or by implementing the `ParallelSourceFunction`
interface or extending the
+`RichParallelSourceFunction` for parallel sources.
There are several predefined stream sources accessible from the
`StreamExecutionEnvironment`:
File-based:
-- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and
returns them as Strings.
+- `readTextFile(path)` - Reads text files, i.e. files that respect the
`TextInputFormat` specification, line-by-line and returns them as Strings.
-- `readFile(path)` / Any input format - Reads files as dictated by the
input format.
-
-- `readFileStream` - create a stream by appending elements when there are
changes to a file
+- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by
the specified file input format.
+- `readFile(fileInputFormat, path, watchType, interval, pathFilter,
typeInfo)` - This is the method called internally by the two previous ones. It
reads files in the `path` based on the given `fileInputFormat`. Depending on
the provided `watchType`, this source may periodically monitor (every
`interval` ms) the path for new data
(`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently
in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the
`pathFilter`, the user can further exclude files from being processed.
+
+ *IMPORTANT NOTES:*
+
+ 1. If the `watchType` is set to
`FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its
contents are re-processed entirely. This can brake the "exactly-once"
semantics, as appending data at the end of a file will lead to **all** its
contents being re-processed.
+
+ 2. If the `watchType` is set to `FileProcessingMode#PROCESS_ONCE`, the
source scans the path **once** and exits, without waiting for the readers to
finish reading. This leads to no more checkpoints after that point, thus
providing reduced fault-tolerance guarantees.
--- End diff --
1) I will integrate that
2) You are right. Files are read completely. The only problem is that upon
recovery, the job will restart from the last checkpoint, i.e. the last before
the source closes.
3) when we process once, there is no explicit cancelation required. We
decided to close the source explicitly, rather than wait for an external
signal. This is also reasonable for resource utilization.
> Reflect streaming file source changes in documentation
> ------------------------------------------------------
>
> Key: FLINK-4133
> URL: https://issues.apache.org/jira/browse/FLINK-4133
> Project: Flink
> Issue Type: Bug
> Components: DataStream API, Documentation
> Reporter: Robert Metzger
> Assignee: Kostas Kloudas
>
> In FLINK-2314 the file sources for the DataStream API were reworked.
> The documentation doesn't explain the (new?) semantics of the file sources.
> In which order are files read?
> How are file modifications treated? (appends, in place modifications?)
> I suspect this table is also not up-to-date:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/fault_tolerance.html#fault-tolerance-guarantees-of-data-sources-and-sinks
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)