[
https://issues.apache.org/jira/browse/FLINK-13956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Metzger updated FLINK-13956:
-----------------------------------
Component/s: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Add sequence file format with repeated sync blocks
> --------------------------------------------------
>
> Key: FLINK-13956
> URL: https://issues.apache.org/jira/browse/FLINK-13956
> Project: Flink
> Issue Type: Improvement
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Reporter: Arvid Heise
> Priority: Minor
>
> The current {{SerializedOutputFormat}} produces files that are tightly bound
> to the block size of the filesystem. While this was a somewhat plausible
> assumption in the old HDFS days, it can lead to [hard to debug issues in
> other file
> systems|https://lists.apache.org/thread.html/bdd87cbb5eb7b19ab4be6501940ec5659e8f6ce6c27ccefa2680732c@%3Cdev.flink.apache.org%3E].
> We could implement a file format similar to the current version of Hadoop's
> SequenceFileFormat: add a sync block in-between records whenever X bytes were
> written. Hadoop uses 2k, but I'd propose to use 1M.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)