Hi, I raised this topic in yesterday's Parquet Sync and I learned that the 1MB page size of parquet-mr was selected because it provides very good compression ratios. However, it is suboptimal for page filtering, which will become increasingly important with the introduction of column indexes. For this reason (and also for consistency), we should come up with a new recommended/default value. I created PARQUET-1190 <https://issues.apache.org/jira/browse/PARQUET-1190> to track this effort.
Br, Zoltan On Tue, Jan 9, 2018 at 10:50 PM Tim Armstrong <[email protected]> wrote: > Impala defaults to 64kb: > > https://github.com/apache/impala/blob/daff8eb0ca19aa612c9fc7cc2ddd647735b31266/be/src/exec/hdfs-parquet-table-writer.h#L83 > > I think larger pages probably have slightly less runtime and encoding > overhead associated with handling page boundaries, but consume more memory > and may be less cache-efficient. I guess if you have 8KB pages you can fit > several pages in the L1 cache of a typical Intel processor (64kb), which > may help with performance. > > I'd be interested to know how the parquet-mr value was arrived at too. > > On Mon, Jan 8, 2018 at 8:51 AM, Zoltan Ivanfi <[email protected]> wrote: > > > Hi, > > > > I noticed the following note regarding page sizes in the Parquet Format > > documentation <https://github.com/apache/parquet-format#configurations>: > > "We recommend 8KB for page sizes." > > > > In the Java implementation > > <https://github.com/apache/parquet-mr/blob/master/ > > parquet-column/src/main/java/org/apache/parquet/column/ > > ParquetProperties.java#L46>, > > however, we have a default page size that is 128 times larger: "public > > static final int DEFAULT_PAGE_SIZE = 1024 * 1024;" > > > > Does anyone know the reason behind this? Should we update the docs? What > > default values do other implementation use? > > > > Thanks, > > > > Zoltan > > >
