[ https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824532#comment-15824532 ]
Zoltan Ivanfi edited comment on PARQUET-321 at 1/16/17 8:12 PM: ---------------------------------------------------------------- Sorry, closed by mistake. was (Author: zi): Fixed in https://github.com/apache/parquet-mr/pull/391 > Set the HDFS padding default to 8MB > ----------------------------------- > > Key: PARQUET-321 > URL: https://issues.apache.org/jira/browse/PARQUET-321 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Reporter: Ryan Blue > Assignee: Ryan Blue > Fix For: 1.9.1 > > > PARQUET-306 added the ability to pad row groups so that they align with HDFS > blocks to avoid remote reads. The ParquetFileWriter will now either pad the > remaining space in the block or target a row group for the remaining size. > The padding maximum controls the threshold of the amount of padding that will > be used. If the space left is under this threshold, it is padded. If it is > greater than this threshold, then the next row group is fit into the > remaining space. The current padding maximum is 0. > I think we should change the padding maximum to 8MB. My reasoning is this: we > want this number to be small enough that it won't prevent the library from > writing reasonable row groups, but larger than the minimum size row group we > would want to write. 8MB is 1/16th of the row group default, so I think it is > reasonable: we don't want a row group to be smaller than 8 MB. > We also want this to be large enough that a few row groups in a block don't > cause a tiny row group to be written in the excess space. 8MB accounts for 4 > row groups that are 2MB under-size. In addition, it is reasonable to not > allow row groups under 8MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)