[ https://issues.apache.org/jira/browse/FLINK-13956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler updated FLINK-13956: -------------------------------- Description: The current {{SerializedOutputFormat}} produces files that are tightly bound to the block size of the filesystem. While this was a somewhat plausible assumption in the old HDFS days, it can lead to [hard to debug issues in other file systems|https://lists.apache.org/thread.html/bdd87cbb5eb7b19ab4be6501940ec5659e8f6ce6c27ccefa2680732c@%3Cdev.flink.apache.org%3E]. We could implement a file format similar to the current version of Hadoop's SequenceFileFormat: add a sync block in-between records whenever X bytes were written. Hadoop uses 2k, but I'd propose to use 1M. was: The current {{SequenceFileFormat}} produces files that are tightly bound to the block size of the filesystem. While this was a somewhat plausible assumption in the old HDFS days, it can lead to [hard to debug issues in other file systems|https://lists.apache.org/thread.html/bdd87cbb5eb7b19ab4be6501940ec5659e8f6ce6c27ccefa2680732c@%3Cdev.flink.apache.org%3E]. We could implement a file format similar to the current version of Hadoop's SequenceFileFormat: add a sync block inbetween records whenever X bytes were written. Hadoop uses 2k, but I'd propose to use 1M. > Add sequence file format with repeated sync blocks > -------------------------------------------------- > > Key: FLINK-13956 > URL: https://issues.apache.org/jira/browse/FLINK-13956 > Project: Flink > Issue Type: Improvement > Reporter: Arvid Heise > Priority: Minor > > The current {{SerializedOutputFormat}} produces files that are tightly bound > to the block size of the filesystem. While this was a somewhat plausible > assumption in the old HDFS days, it can lead to [hard to debug issues in > other file > systems|https://lists.apache.org/thread.html/bdd87cbb5eb7b19ab4be6501940ec5659e8f6ce6c27ccefa2680732c@%3Cdev.flink.apache.org%3E]. > We could implement a file format similar to the current version of Hadoop's > SequenceFileFormat: add a sync block in-between records whenever X bytes were > written. Hadoop uses 2k, but I'd propose to use 1M. -- This message was sent by Atlassian Jira (v8.3.2#803003)