Hi,
I am trying to understand the benefits of using multiple data pages and
indexes vs multiple row groups.
Some basics first:
row groups ensures that a sequence of rows are "aligned" at the group
boundary independently of how they are divided in pages:
row group 1:
c1: |--p11--|--p12--|---p13---|
c2: |--p21--|-------p22-------|
c3: |-p31-|-p32-|-----p23-----|
...
rows in a page are encoded and compressed together. The more rows in a
page, the higher the encoding/compression potential.
There is a tradeoff between selectivity potential and encoding/compression
potential, in that the more row groups/page boundaries there are, the
higher the selectivity potential, but the lower the encoding/compression
potential, and vice-versa.
We currently have 4 methods to perform filter pushdown:
* row group selection via row group statistics
* row group selection via bloom filters
* page selection via page statistics (deprecated)
* page selection via page indexes and column indexes.
Page indexes and column indexes store page locations and page statistics,
without having to read the pages sequentially (and without having to read
the page header).
I am wondering why page selection is better than row group selection in
that why should we create multiple pages per row group instead of writing
more row groups with a single data page.
Specifically, given a sequence of data pages from which we identified that
only a subset is valid, say in the example above a filter in column c3
selected `p32`, to apply the filter across columns, we need to select the
pages from all columns that are part of the interval of rows from which
that page belongs to:
row group 1:
c1: |--p11--|--p12--|---p13---|
c2: |--p21--|-------p22-------|
c3: |-p31-|-p32-|-----p23-----|
--I--
which, in this example is:
c1: |--p11--|--p12--|
c2: |--p21--|-------p22-------|
c3: |-p31-|-p32-|-----p23-----|
--I--
(i.e. we still had to load most of the data due to non-alignment between
pages).
My thinking here is that If instead of creating multiple data pages per row
group, we could have created multiple row groups with a single data page:
row group 1:
c1: |--p11--|
c2: |--p21--|
c3: |--p31--|
row group 2:
c1: |--p12--|
c2: |--p21--|
c3: |--p32--|
...
and the filter would be applied at the row group level, i.e. by only
selecting row group 2. I.e. the fact that row group boundaries enforce a
row alignment seems quite important for filter pushdown, as it minimizes
data load across columns.
Is it the idea that pages solve a different purpose, and column and page
indexes are used to improve the filter pushdown potential? If yes, what is
the purpose of having multiple pages in a single row group? I could not
find documentation or guidelines on this. I remember reading somewhere that
most writers have a default number of bytes per page optimal for HDFS for
optimal retrieval(?), but e.g. in distributed blob storage, these are not
very relevant afaik(?).
Any guidance would be very much appreciated.
Best,
Jorge