Github user kl0u commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2198#discussion_r69561903
  
    --- Diff: docs/apis/streaming/index.md ---
    @@ -1310,21 +1310,28 @@ Data Sources
     
     <br />
     
    -Sources can by created by using 
`StreamExecutionEnvironment.addSource(sourceFunction)`.
    -You can either use one of the source functions that come with Flink or 
write a custom source
    -by implementing the `SourceFunction` for non-parallel sources, or by 
implementing the
    -`ParallelSourceFunction` interface or extending 
`RichParallelSourceFunction` for parallel sources.
    +Sources are where your program reads its input from. You can attach a 
source to your program by 
    +using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes 
with a number of pre-implemented 
    +source functions, but you can always write your own custom sources by 
implementing the `SourceFunction` 
    +for non-parallel sources, or by implementing the `ParallelSourceFunction` 
interface or extending the 
    +`RichParallelSourceFunction` for parallel sources.
     
     There are several predefined stream sources accessible from the 
`StreamExecutionEnvironment`:
     
     File-based:
     
    -- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and 
returns them as Strings.
    +- `readTextFile(path)` - Reads text files, i.e. files that respect the 
`TextInputFormat` specification, line-by-line and returns them as Strings.
     
    -- `readFile(path)` / Any input format - Reads files as dictated by the 
input format.
    -
    -- `readFileStream` - create a stream by appending elements when there are 
changes to a file
    +- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by 
the specified file input format.
     
    +- `readFile(fileInputFormat, path, watchType, interval, pathFilter, 
typeInfo)` -  This is the method called internally by the two previous ones. It 
reads files in the `path` based on the given `fileInputFormat`. Depending on 
the provided `watchType`, this source may periodically monitor (every 
`interval` ms) the path for new data 
(`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently 
in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the 
`pathFilter`, the user can further exclude files from being processed.
    +               
    +    *IMPORTANT NOTES:* 
    +               
    +    1. If the `watchType` is set to 
`FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its 
contents are re-processed entirely. This can brake the "exactly-once" 
semantics, as appending data at the end of a file will lead to **all** its 
contents being re-processed.
    +               
    +    2. If the `watchType` is set to `FileProcessingMode#PROCESS_ONCE`, the 
source scans the path **once** and exits, without waiting for the readers to 
finish reading. This leads to no more checkpoints after that point, thus 
providing reduced fault-tolerance guarantees.
    +                                                                           
  
    --- End diff --
    
    Will do that!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to