[jira] [Created] (FLINK-9753) Support ORC/Parquet for StreamingFileSink

Stephan Ewen (JIRA) Wed, 04 Jul 2018 13:16:06 -0700

Stephan Ewen created FLINK-9753:
-----------------------------------

             Summary: Support ORC/Parquet for StreamingFileSink
                 Key: FLINK-9753
                 URL: https://issues.apache.org/jira/browse/FLINK-9753
             Project: Flink
          Issue Type: Sub-task
          Components: Streaming Connectors
            Reporter: Stephan Ewen
            Assignee: Kostas Kloudas
             Fix For: 1.6.0



Formats like Parquet and ORC are great at compressing data and making it fast 
to scan/filter/project the data.
However, these formats are only efficient, if they can columnarize and compress 
a significant amount of data in their columnar format. If they compress only a 
few rows at a time, they produce many short column vecors and are thus much 
less efficient.

The Bucketing Sink has the requirement that data is persistent on the target 
FileSystem on each checkpoint.
Pushing data through a Parquet or ORC encoder and flushing on each checkpoint 
means that for frequent checkpoints, the amount of data compressed/columnarized 
in a block is small. Hence, the result is an inefficiently compressed file.

Making this efficient independently of the checkpoint interval would mean that 
the sink needs to first collect (and persist) a good amount of data and then 
push it through the Parquet/ORC writers.

I would suggest to approach this as follows:

 - When writing to the "in progress files" write the raw records 
(TypeSerializer encoding)
 - When the "in progress file" is rolled over (published), the sink pushes the 
data through the encoder.
 - This is not much work on top of the new abstraction and will result in large 
blocksand hence in efficient compression.

Alternatively, we can support directly encoding the stream to the "in progress 
files" via Parque/ORC, if users know that their combination of data rate and 
checkpoint interval will result in large enough chunks of data per checkpoint 
interval.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-9753) Support ORC/Parquet for StreamingFileSink

Reply via email to