I can take your comment two ways: what is the downside to large pages or what is the downside to small row groups.
One of the key considerations I've dealt with is that page is the unit of compression and if I recall correctly, parquet uses block rather than stream compression. This means you typically need enough memory to hold at least one page per column worth of data in memory at a time. Bigger pages increase this requirement. With many concurrent cores and wide tables (hundreds or thousands of columns), this can be a substantial amount of memory. So you don't want pages too large in these cases. For row groups, being able to read only select columns can improve io. However, the smaller the column chunk (all pages for a column) in a row group, the less benefit one can gain by avoiding io. This is true for traditional distributed file systems as well as cloud stores (in either case you want to amortize the access time across a large enough transfer). So yes, the column indexes help performance of pruning pages which are typically one to two orders of magnitudes smaller than column chunks. One key advantage of column indexes even in the case that you fail to avoid io time is avoidance of decompression of skippable pages. In fact, an early version of column indexes we used was simply duping the page headers in the footer. On Sat, Mar 19, 2022, 4:42 AM Jorge Cardoso Leitão <[email protected]> wrote: > Hi, > > I am trying to understand the benefits of using multiple data pages and > indexes vs multiple row groups. > > Some basics first: > > row groups ensures that a sequence of rows are "aligned" at the group > boundary independently of how they are divided in pages: > > row group 1: > c1: |--p11--|--p12--|---p13---| > c2: |--p21--|-------p22-------| > c3: |-p31-|-p32-|-----p23-----| > ... > > rows in a page are encoded and compressed together. The more rows in a > page, the higher the encoding/compression potential. > > There is a tradeoff between selectivity potential and encoding/compression > potential, in that the more row groups/page boundaries there are, the > higher the selectivity potential, but the lower the encoding/compression > potential, and vice-versa. > > We currently have 4 methods to perform filter pushdown: > * row group selection via row group statistics > * row group selection via bloom filters > * page selection via page statistics (deprecated) > * page selection via page indexes and column indexes. > > Page indexes and column indexes store page locations and page statistics, > without having to read the pages sequentially (and without having to read > the page header). > > I am wondering why page selection is better than row group selection in > that why should we create multiple pages per row group instead of writing > more row groups with a single data page. > > Specifically, given a sequence of data pages from which we identified that > only a subset is valid, say in the example above a filter in column c3 > selected `p32`, to apply the filter across columns, we need to select the > pages from all columns that are part of the interval of rows from which > that page belongs to: > > row group 1: > c1: |--p11--|--p12--|---p13---| > c2: |--p21--|-------p22-------| > c3: |-p31-|-p32-|-----p23-----| > --I-- > > which, in this example is: > > c1: |--p11--|--p12--| > c2: |--p21--|-------p22-------| > c3: |-p31-|-p32-|-----p23-----| > --I-- > > (i.e. we still had to load most of the data due to non-alignment between > pages). > > My thinking here is that If instead of creating multiple data pages per row > group, we could have created multiple row groups with a single data page: > > row group 1: > c1: |--p11--| > c2: |--p21--| > c3: |--p31--| > row group 2: > c1: |--p12--| > c2: |--p21--| > c3: |--p32--| > ... > > and the filter would be applied at the row group level, i.e. by only > selecting row group 2. I.e. the fact that row group boundaries enforce a > row alignment seems quite important for filter pushdown, as it minimizes > data load across columns. > > Is it the idea that pages solve a different purpose, and column and page > indexes are used to improve the filter pushdown potential? If yes, what is > the purpose of having multiple pages in a single row group? I could not > find documentation or guidelines on this. I remember reading somewhere that > most writers have a default number of bytes per page optimal for HDFS for > optimal retrieval(?), but e.g. in distributed blob storage, these are not > very relevant afaik(?). > > Any guidance would be very much appreciated. > > Best, > Jorge >
