Stephan Ewen created FLINK-9753:
-----------------------------------
Summary: Support ORC/Parquet for StreamingFileSink
Key: FLINK-9753
URL: https://issues.apache.org/jira/browse/FLINK-9753
Project: Flink
Issue Type: Sub-task
Components: Streaming Connectors
Reporter: Stephan Ewen
Assignee: Kostas Kloudas
Fix For: 1.6.0
Formats like Parquet and ORC are great at compressing data and making it fast
to scan/filter/project the data.
However, these formats are only efficient, if they can columnarize and compress
a significant amount of data in their columnar format. If they compress only a
few rows at a time, they produce many short column vecors and are thus much
less efficient.
The Bucketing Sink has the requirement that data is persistent on the target
FileSystem on each checkpoint.
Pushing data through a Parquet or ORC encoder and flushing on each checkpoint
means that for frequent checkpoints, the amount of data compressed/columnarized
in a block is small. Hence, the result is an inefficiently compressed file.
Making this efficient independently of the checkpoint interval would mean that
the sink needs to first collect (and persist) a good amount of data and then
push it through the Parquet/ORC writers.
I would suggest to approach this as follows:
- When writing to the "in progress files" write the raw records
(TypeSerializer encoding)
- When the "in progress file" is rolled over (published), the sink pushes the
data through the encoder.
- This is not much work on top of the new abstraction and will result in large
blocksand hence in efficient compression.
Alternatively, we can support directly encoding the stream to the "in progress
files" via Parque/ORC, if users know that their combination of data rate and
checkpoint interval will result in large enough chunks of data per checkpoint
interval.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)