Optimizing compression ratios is one issue, optimizing page granularity for column indexes is another, and a third issue is that there is per-page metadata in the Parquet footer in Thrift format that has to be interpreted before anything in the file can be accessed. Too many pages could slow down file-opening, and if you're only interested in a few columns, reading per-page metadata for columns you don't care about could dominate.
(I recently worked with a file that had too-small row groups and therefore too much metadata; just *opening* the file in parquet-python took over a minute because all of the Thrift data was being interpreted by pure Python.) -- Jim On Wed, Jan 10, 2018 at 8:19 AM, Zoltan Ivanfi <[email protected]> wrote: > Hi, > > I raised this topic in yesterday's Parquet Sync and I learned that the 1MB > page size of parquet-mr was selected because it provides very good > compression ratios. However, it is suboptimal for page filtering, which > will become increasingly important with the introduction of column indexes. > For this reason (and also for consistency), we should come up with a new > recommended/default value. I created PARQUET-1190 > <https://issues.apache.org/jira/browse/PARQUET-1190> to track this effort. > > Br, > > Zoltan > > On Tue, Jan 9, 2018 at 10:50 PM Tim Armstrong <[email protected]> > wrote: > > > Impala defaults to 64kb: > > > > https://github.com/apache/impala/blob/daff8eb0ca19aa612c9fc7cc2ddd64 > 7735b31266/be/src/exec/hdfs-parquet-table-writer.h#L83 > > > > I think larger pages probably have slightly less runtime and encoding > > overhead associated with handling page boundaries, but consume more > memory > > and may be less cache-efficient. I guess if you have 8KB pages you can > fit > > several pages in the L1 cache of a typical Intel processor (64kb), which > > may help with performance. > > > > I'd be interested to know how the parquet-mr value was arrived at too. > > > > On Mon, Jan 8, 2018 at 8:51 AM, Zoltan Ivanfi <[email protected]> wrote: > > > > > Hi, > > > > > > I noticed the following note regarding page sizes in the Parquet Format > > > documentation <https://github.com/apache/parquet-format#configurations > >: > > > "We recommend 8KB for page sizes." > > > > > > In the Java implementation > > > <https://github.com/apache/parquet-mr/blob/master/ > > > parquet-column/src/main/java/org/apache/parquet/column/ > > > ParquetProperties.java#L46>, > > > however, we have a default page size that is 128 times larger: "public > > > static final int DEFAULT_PAGE_SIZE = 1024 * 1024;" > > > > > > Does anyone know the reason behind this? Should we update the docs? > What > > > default values do other implementation use? > > > > > > Thanks, > > > > > > Zoltan > > > > > >
