[
https://issues.apache.org/jira/browse/FLINK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640990#comment-16640990
]
Huyen Levan commented on FLINK-9753:
------------------------------------
[~kkl0u] In the description section you mentioned two approaches: (1) only
encode when publishing, and (2) encode right from the time writing to
in-progress files. May I ask which approach you chose at the end? If you
provided both, how can the user choose which approach to use?
Thanks!
> Support Parquet for StreamingFileSink
> -------------------------------------
>
> Key: FLINK-9753
> URL: https://issues.apache.org/jira/browse/FLINK-9753
> Project: Flink
> Issue Type: Sub-task
> Components: Streaming Connectors
> Reporter: Stephan Ewen
> Assignee: Kostas Kloudas
> Priority: Major
> Fix For: 1.6.0
>
>
> Formats like Parquet and ORC are great at compressing data and making it fast
> to scan/filter/project the data.
> However, these formats are only efficient, if they can columnarize and
> compress a significant amount of data in their columnar format. If they
> compress only a few rows at a time, they produce many short column vecors and
> are thus much less efficient.
> The Bucketing Sink has the requirement that data is persistent on the target
> FileSystem on each checkpoint.
> Pushing data through a Parquet or ORC encoder and flushing on each checkpoint
> means that for frequent checkpoints, the amount of data
> compressed/columnarized in a block is small. Hence, the result is an
> inefficiently compressed file.
> Making this efficient independently of the checkpoint interval would mean
> that the sink needs to first collect (and persist) a good amount of data and
> then push it through the Parquet/ORC writers.
> I would suggest to approach this as follows:
> - When writing to the "in progress files" write the raw records
> (TypeSerializer encoding)
> - When the "in progress file" is rolled over (published), the sink pushes
> the data through the encoder.
> - This is not much work on top of the new abstraction and will result in
> large blocksand hence in efficient compression.
> Alternatively, we can support directly encoding the stream to the "in
> progress files" via Parque/ORC, if users know that their combination of data
> rate and checkpoint interval will result in large enough chunks of data per
> checkpoint interval.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)