Optimizing compression ratios is one issue, optimizing page granularity for
column indexes is another, and a third issue is that there is per-page
metadata in the Parquet footer in Thrift format that has to be interpreted
before anything in the file can be accessed. Too many pages could slow down
file-opening, and if you're only interested in a few columns, reading
per-page metadata for columns you don't care about could dominate.

(I recently worked with a file that had too-small row groups and therefore
too much metadata; just *opening* the file in parquet-python took over a
minute because all of the Thrift data was being interpreted by pure Python.)

-- Jim





On Wed, Jan 10, 2018 at 8:19 AM, Zoltan Ivanfi <[email protected]> wrote:

> Hi,
>
> I raised this topic in yesterday's Parquet Sync and I learned that the 1MB
> page size of parquet-mr was selected because it provides very good
> compression ratios. However, it is suboptimal for page filtering, which
> will become increasingly important with the introduction of column indexes.
> For this reason (and also for consistency), we should come up with a new
> recommended/default value. I created PARQUET-1190
> <https://issues.apache.org/jira/browse/PARQUET-1190> to track this effort.
>
> Br,
>
> Zoltan
>
> On Tue, Jan 9, 2018 at 10:50 PM Tim Armstrong <[email protected]>
> wrote:
>
> > Impala defaults to 64kb:
> >
> > https://github.com/apache/impala/blob/daff8eb0ca19aa612c9fc7cc2ddd64
> 7735b31266/be/src/exec/hdfs-parquet-table-writer.h#L83
> >
> > I think larger pages probably have slightly less runtime and encoding
> > overhead associated with handling page boundaries, but consume more
> memory
> > and may be less cache-efficient. I guess if you have 8KB pages you can
> fit
> > several pages in the L1 cache of a typical Intel processor (64kb), which
> > may help with performance.
> >
> > I'd be interested to know how the parquet-mr value was arrived at too.
> >
> > On Mon, Jan 8, 2018 at 8:51 AM, Zoltan Ivanfi <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I noticed the following note regarding page sizes in the Parquet Format
> > > documentation <https://github.com/apache/parquet-format#configurations
> >:
> > > "We recommend 8KB for page sizes."
> > >
> > > In the Java implementation
> > > <https://github.com/apache/parquet-mr/blob/master/
> > > parquet-column/src/main/java/org/apache/parquet/column/
> > > ParquetProperties.java#L46>,
> > > however, we have a default page size that is 128 times larger: "public
> > > static final int DEFAULT_PAGE_SIZE = 1024 * 1024;"
> > >
> > > Does anyone know the reason behind this? Should we update the docs?
> What
> > > default values do other implementation use?
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> >
>

Reply via email to