Impala defaults to 64kb:
https://github.com/apache/impala/blob/daff8eb0ca19aa612c9fc7cc2ddd647735b31266/be/src/exec/hdfs-parquet-table-writer.h#L83

I think larger pages probably have slightly less runtime and encoding
overhead associated with handling page boundaries, but consume more memory
and may be less cache-efficient. I guess if you have 8KB pages you can fit
several pages in the L1 cache of a typical Intel processor (64kb), which
may help with performance.

I'd be interested to know how the parquet-mr value was arrived at too.

On Mon, Jan 8, 2018 at 8:51 AM, Zoltan Ivanfi <[email protected]> wrote:

> Hi,
>
> I noticed the following note regarding page sizes in the Parquet Format
> documentation <https://github.com/apache/parquet-format#configurations>:
> "We recommend 8KB for page sizes."
>
> In the Java implementation
> <https://github.com/apache/parquet-mr/blob/master/
> parquet-column/src/main/java/org/apache/parquet/column/
> ParquetProperties.java#L46>,
> however, we have a default page size that is 128 times larger: "public
> static final int DEFAULT_PAGE_SIZE = 1024 * 1024;"
>
> Does anyone know the reason behind this? Should we update the docs? What
> default values do other implementation use?
>
> Thanks,
>
> Zoltan
>

Reply via email to