Hi, If you use HDFS, then the row group size should match the HDFS block size, otherwise data locality (thus performance) will suffer.
Regarding page size, in general larger pages lead to smaller files. On the other hand, the page-level metadata may include min and max values that can be used for skipping entire pages when looking for specific values which do not fall in their min-max range. With larger pages, this possibility to skip pages becomes less fine-grained, so in the end more data may have to be deserialized. Zoltan On Fri, Jan 12, 2018 at 10:19 PM Ryan Blue <[email protected]> wrote: > I recommend trying different values using the parquet-cli. That's an easy > way to see how different row group and page sizes perform. That's what I do > to tune all of our tables. > > rb > > On Fri, Jan 12, 2018 at 10:43 AM, ALeX Wang <[email protected]> wrote: > > > Hi, > > > > I'm using parquet to store a big table (400+ columns), and most of > columns > > will be none > > > > Is there any recommended rowgroup size and the number of row groups per > > parquet file for my use case? Or is there any reference/paper that I > could > > read myself, > > > > > > Thanks, > > -- > > Alex Wang, > > Open vSwitch developer > > > > > > -- > Ryan Blue > Software Engineer > Netflix >
