[ 
https://issues.apache.org/jira/browse/FLINK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362470#comment-15362470
 ] 

ASF GitHub Bot commented on FLINK-4133:
---------------------------------------

Github user rmetzger commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2198#discussion_r69560355
  
    --- Diff: docs/apis/streaming/index.md ---
    @@ -1310,21 +1310,28 @@ Data Sources
     
     <br />
     
    -Sources can by created by using 
`StreamExecutionEnvironment.addSource(sourceFunction)`.
    -You can either use one of the source functions that come with Flink or 
write a custom source
    -by implementing the `SourceFunction` for non-parallel sources, or by 
implementing the
    -`ParallelSourceFunction` interface or extending 
`RichParallelSourceFunction` for parallel sources.
    +Sources are where your program reads its input from. You can attach a 
source to your program by 
    +using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes 
with a number of pre-implemented 
    +source functions, but you can always write your own custom sources by 
implementing the `SourceFunction` 
    +for non-parallel sources, or by implementing the `ParallelSourceFunction` 
interface or extending the 
    +`RichParallelSourceFunction` for parallel sources.
     
     There are several predefined stream sources accessible from the 
`StreamExecutionEnvironment`:
     
     File-based:
     
    -- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and 
returns them as Strings.
    +- `readTextFile(path)` - Reads text files, i.e. files that respect the 
`TextInputFormat` specification, line-by-line and returns them as Strings.
     
    -- `readFile(path)` / Any input format - Reads files as dictated by the 
input format.
    -
    -- `readFileStream` - create a stream by appending elements when there are 
changes to a file
    +- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by 
the specified file input format.
     
    +- `readFile(fileInputFormat, path, watchType, interval, pathFilter, 
typeInfo)` -  This is the method called internally by the two previous ones. It 
reads files in the `path` based on the given `fileInputFormat`. Depending on 
the provided `watchType`, this source may periodically monitor (every 
`interval` ms) the path for new data 
(`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently 
in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the 
`pathFilter`, the user can further exclude files from being processed.
    +               
    +    *IMPORTANT NOTES:* 
    +               
    +    1. If the `watchType` is set to 
`FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its 
contents are re-processed entirely. This can brake the "exactly-once" 
semantics, as appending data at the end of a file will lead to **all** its 
contents being re-processed.
    +               
    +    2. If the `watchType` is set to `FileProcessingMode#PROCESS_ONCE`, the 
source scans the path **once** and exits, without waiting for the readers to 
finish reading. This leads to no more checkpoints after that point, thus 
providing reduced fault-tolerance guarantees.
    +                                                                           
  
    --- End diff --
    
    I think we should also explain how the file monitoring works (with one task 
monitoring the directory and a bunch of tasks reading the files)


> Reflect streaming file source changes in documentation
> ------------------------------------------------------
>
>                 Key: FLINK-4133
>                 URL: https://issues.apache.org/jira/browse/FLINK-4133
>             Project: Flink
>          Issue Type: Bug
>          Components: DataStream API, Documentation
>            Reporter: Robert Metzger
>            Assignee: Kostas Kloudas
>
> In FLINK-2314 the file sources for the DataStream API were reworked.
> The documentation doesn't explain the (new?) semantics of the file sources.
> In which order are files read?
> How are file modifications treated? (appends, in place modifications?)
> I suspect this table is also not up-to-date: 
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/fault_tolerance.html#fault-tolerance-guarantees-of-data-sources-and-sinks
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to