[jira] [Commented] (FLINK-9753) Support Parquet for StreamingFileSink

Huyen Levan (JIRA) Sun, 07 Oct 2018 00:56:43 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640990#comment-16640990
 ]


Huyen Levan commented on FLINK-9753:
------------------------------------

[~kkl0u] In the description section you mentioned two approaches: (1) only 
encode when publishing, and (2) encode right from the time writing to 
in-progress files. May I ask which approach you chose at the end? If you 
provided both, how can the user choose which approach to use? 
Thanks!

> Support Parquet for StreamingFileSink
> -------------------------------------
>
>                 Key: FLINK-9753
>                 URL: https://issues.apache.org/jira/browse/FLINK-9753
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Streaming Connectors
>            Reporter: Stephan Ewen
>            Assignee: Kostas Kloudas
>            Priority: Major
>             Fix For: 1.6.0
>
>
> Formats like Parquet and ORC are great at compressing data and making it fast 
> to scan/filter/project the data.
> However, these formats are only efficient, if they can columnarize and 
> compress a significant amount of data in their columnar format. If they 
> compress only a few rows at a time, they produce many short column vecors and 
> are thus much less efficient.
> The Bucketing Sink has the requirement that data is persistent on the target 
> FileSystem on each checkpoint.
> Pushing data through a Parquet or ORC encoder and flushing on each checkpoint 
> means that for frequent checkpoints, the amount of data 
> compressed/columnarized in a block is small. Hence, the result is an 
> inefficiently compressed file.
> Making this efficient independently of the checkpoint interval would mean 
> that the sink needs to first collect (and persist) a good amount of data and 
> then push it through the Parquet/ORC writers.
> I would suggest to approach this as follows:
>  - When writing to the "in progress files" write the raw records 
> (TypeSerializer encoding)
>  - When the "in progress file" is rolled over (published), the sink pushes 
> the data through the encoder.
>  - This is not much work on top of the new abstraction and will result in 
> large blocksand hence in efficient compression.
> Alternatively, we can support directly encoding the stream to the "in 
> progress files" via Parque/ORC, if users know that their combination of data 
> rate and checkpoint interval will result in large enough chunks of data per 
> checkpoint interval.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9753) Support Parquet for StreamingFileSink

Reply via email to