[
https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltan Ivanfi resolved PARQUET-321.
-----------------------------------
Resolution: Fixed
Fix Version/s: 1.9.1
https://github.com/apache/parquet-mr/pull/391
> Set the HDFS padding default to 8MB
> -----------------------------------
>
> Key: PARQUET-321
> URL: https://issues.apache.org/jira/browse/PARQUET-321
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Reporter: Ryan Blue
> Assignee: Ryan Blue
> Fix For: 1.9.1
>
>
> PARQUET-306 added the ability to pad row groups so that they align with HDFS
> blocks to avoid remote reads. The ParquetFileWriter will now either pad the
> remaining space in the block or target a row group for the remaining size.
> The padding maximum controls the threshold of the amount of padding that will
> be used. If the space left is under this threshold, it is padded. If it is
> greater than this threshold, then the next row group is fit into the
> remaining space. The current padding maximum is 0.
> I think we should change the padding maximum to 8MB. My reasoning is this: we
> want this number to be small enough that it won't prevent the library from
> writing reasonable row groups, but larger than the minimum size row group we
> would want to write. 8MB is 1/16th of the row group default, so I think it is
> reasonable: we don't want a row group to be smaller than 8 MB.
> We also want this to be large enough that a few row groups in a block don't
> cause a tiny row group to be written in the excess space. 8MB accounts for 4
> row groups that are 2MB under-size. In addition, it is reasonable to not
> allow row groups under 8MB.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)