[ 
https://issues.apache.org/jira/browse/FLINK-20188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444601#comment-17444601
 ] 

Stephan Ewen commented on FLINK-20188:
--------------------------------------

Hi!

I started writing/outlining some documentation for the File Source some weeks 
back. If you are interested, you could take this effort over. 
My outline for the File Source documentation is below. I think it would be good 
to first discuss an outline you want to use (the suggestion below, or come up 
with your own outline) and after that start with the main contents writing.

===============================
Outline

h1. File Source Docs

_(introductory paragraph)_

Read a set of files from a FileSystem (POSIX, S3, HDFS; ...) with a *Format* 
(e.g., CSV, Parquet, Avro, ORC, ...), producing a stream or records (DataStream 
API) or rows (SQL, Table API).

h2. Basic Structure: File Enumerator and Format

* Enumerator: Discovering and identifying the files to read.
* Format: Parser for reading files. Contain for example the CSV parser, the 
Avro decoder, or the Parquet columnar reader.

h2. Bounded and Unbounded Streams

Bounded File source lists all files (via enumerator, usually recursive 
directory list, filter hidden files) and reads them all. 

Unbounded source is created when configuring the enumerator for periodic file 
discovery. In that case will enumerate initially like the bounded case, but 
additionally also enumerate repeatedly in a certain interval. For repeated 
enumerations, filter out previously seen files, and send the new ones to the 
readers.

h2. Creating a File Source

(code example section with three tabs, DataStream, SQL, Table API)

  - DDL for SQL
  - Table API source builder
  - DataStream API code to create file source
    - DataStream API needs to call right method, depending on format type.


h2. Types of Formats & Implementing a new Format

* Stream Format: Simple format for reading records from a stream. Takes care of 
several aspects, like applying common decompressors, and batching the record 
handover between I/O threads and runtime. Good match for formats that encode 
records individually (like CSV, Lines-wise, ...)
** SimpleStreamFormat as an even simpler case for non-splittable files.
* Bulk Format: Gives user explicit control over reading file, and handing over 
the bundles of records to the runtime. Suited particularly for formats that 
encode/compress data in batches (Parquet/ORC). Also offers more control over 
record deserialization and handover between I/O threads and runtime, so can be 
a good choice if one wants to do deep performance optimization.

h2. Customizing File Enumeration

Implement custom enumerators and split assigners.

h2. Current Limitations

Watermarking doesn't work particularly well for large backlogs of files, 
because watermarks eagerly advance within a file, and the next file might 
contain data later than the watermark again. We are looking at ways to generate 
the watermarks more based on global information.

For Unbounded File Sources, the enumerator currently remembers paths of all 
already processed files, which is a state that can in come cases grow rather 
large. We plan to add a compressed form of tracking already processed files in 
the future (for example by keeping modification timestamps lower boundaries).


> Add Documentation for new File Source
> -------------------------------------
>
>                 Key: FLINK-20188
>                 URL: https://issues.apache.org/jira/browse/FLINK-20188
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / FileSystem, Documentation
>    Affects Versions: 1.14.0, 1.13.3, 1.15.0
>            Reporter: Stephan Ewen
>            Priority: Blocker
>             Fix For: 1.15.0, 1.14.1, 1.13.4
>
>         Attachments: image-2021-11-16-11-42-32-957.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to