Impala defaults to 64kb: https://github.com/apache/impala/blob/daff8eb0ca19aa612c9fc7cc2ddd647735b31266/be/src/exec/hdfs-parquet-table-writer.h#L83
I think larger pages probably have slightly less runtime and encoding overhead associated with handling page boundaries, but consume more memory and may be less cache-efficient. I guess if you have 8KB pages you can fit several pages in the L1 cache of a typical Intel processor (64kb), which may help with performance. I'd be interested to know how the parquet-mr value was arrived at too. On Mon, Jan 8, 2018 at 8:51 AM, Zoltan Ivanfi <[email protected]> wrote: > Hi, > > I noticed the following note regarding page sizes in the Parquet Format > documentation <https://github.com/apache/parquet-format#configurations>: > "We recommend 8KB for page sizes." > > In the Java implementation > <https://github.com/apache/parquet-mr/blob/master/ > parquet-column/src/main/java/org/apache/parquet/column/ > ParquetProperties.java#L46>, > however, we have a default page size that is 128 times larger: "public > static final int DEFAULT_PAGE_SIZE = 1024 * 1024;" > > Does anyone know the reason behind this? Should we update the docs? What > default values do other implementation use? > > Thanks, > > Zoltan >
