[ 
https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824532#comment-15824532
 ] 

Zoltan Ivanfi edited comment on PARQUET-321 at 1/16/17 8:12 PM:
----------------------------------------------------------------

Sorry, closed by mistake.


was (Author: zi):
Fixed in https://github.com/apache/parquet-mr/pull/391

> Set the HDFS padding default to 8MB
> -----------------------------------
>
>                 Key: PARQUET-321
>                 URL: https://issues.apache.org/jira/browse/PARQUET-321
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>             Fix For: 1.9.1
>
>
> PARQUET-306 added the ability to pad row groups so that they align with HDFS 
> blocks to avoid remote reads. The ParquetFileWriter will now either pad the 
> remaining space in the block or target a row group for the remaining size.
> The padding maximum controls the threshold of the amount of padding that will 
> be used. If the space left is under this threshold, it is padded. If it is 
> greater than this threshold, then the next row group is fit into the 
> remaining space. The current padding maximum is 0.
> I think we should change the padding maximum to 8MB. My reasoning is this: we 
> want this number to be small enough that it won't prevent the library from 
> writing reasonable row groups, but larger than the minimum size row group we 
> would want to write. 8MB is 1/16th of the row group default, so I think it is 
> reasonable: we don't want a row group to be smaller than 8 MB.
> We also want this to be large enough that a few row groups in a  block don't 
> cause a tiny row group to be written in the excess space. 8MB accounts for 4 
> row groups that are 2MB under-size. In addition, it is reasonable to not 
> allow row groups under 8MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to