Hi,

If you use HDFS, then the row group size should match the HDFS block size,
otherwise data locality (thus performance) will suffer.

Regarding page size, in general larger pages lead to smaller files. On the
other hand, the page-level metadata may include min and max values that can
be used for skipping entire pages when looking for specific values which do
not fall in their min-max range. With larger pages, this possibility to
skip pages becomes less fine-grained, so in the end more data may have to
be deserialized.

Zoltan

On Fri, Jan 12, 2018 at 10:19 PM Ryan Blue <[email protected]>
wrote:

> I recommend trying different values using the parquet-cli. That's an easy
> way to see how different row group and page sizes perform. That's what I do
> to tune all of our tables.
>
> rb
>
> On Fri, Jan 12, 2018 at 10:43 AM, ALeX Wang <[email protected]> wrote:
>
> > Hi,
> >
> > I'm using parquet to store a big table (400+ columns), and most of
> columns
> > will be none
> >
> > Is there any recommended rowgroup size and the number of row groups per
> > parquet file for my use case?  Or is there any reference/paper that I
> could
> > read myself,
> >
> >
> > Thanks,
> > --
> > Alex Wang,
> > Open vSwitch developer
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to