[jira] [Updated] (FLINK-13956) Add sequence file format with repeated sync blocks

Ken Krugler (Jira) Wed, 04 Sep 2019 15:33:54 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-13956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ken Krugler updated FLINK-13956:
--------------------------------
    Description: 
The current {{SerializedOutputFormat}} produces files that are tightly bound to 
the block size of the filesystem. While this was a somewhat plausible 
assumption in the old HDFS days, it can lead to [hard to debug issues in other 
file 
systems|https://lists.apache.org/thread.html/bdd87cbb5eb7b19ab4be6501940ec5659e8f6ce6c27ccefa2680732c@%3Cdev.flink.apache.org%3E].

We could implement a file format similar to the current version of Hadoop's 
SequenceFileFormat: add a sync block in-between records whenever X bytes were 
written. Hadoop uses 2k, but I'd propose to use 1M.

  was:
The current {{SequenceFileFormat}} produces files that are tightly bound to the 
block size of the filesystem. While this was a somewhat plausible assumption in 
the old HDFS days, it can lead to [hard to debug issues in other file 
systems|https://lists.apache.org/thread.html/bdd87cbb5eb7b19ab4be6501940ec5659e8f6ce6c27ccefa2680732c@%3Cdev.flink.apache.org%3E].

We could implement a file format similar to the current version of Hadoop's 
SequenceFileFormat: add a sync block inbetween records whenever X bytes were 
written. Hadoop uses 2k, but I'd propose to use 1M.


> Add sequence file format with repeated sync blocks
> --------------------------------------------------
>
>                 Key: FLINK-13956
>                 URL: https://issues.apache.org/jira/browse/FLINK-13956
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Arvid Heise
>            Priority: Minor
>
> The current {{SerializedOutputFormat}} produces files that are tightly bound 
> to the block size of the filesystem. While this was a somewhat plausible 
> assumption in the old HDFS days, it can lead to [hard to debug issues in 
> other file 
> systems|https://lists.apache.org/thread.html/bdd87cbb5eb7b19ab4be6501940ec5659e8f6ce6c27ccefa2680732c@%3Cdev.flink.apache.org%3E].
> We could implement a file format similar to the current version of Hadoop's 
> SequenceFileFormat: add a sync block in-between records whenever X bytes were 
> written. Hadoop uses 2k, but I'd propose to use 1M.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (FLINK-13956) Add sequence file format with repeated sync blocks

Reply via email to