I had a similar concern to Uwe - if there are a large number of columns
with variable size there does seem to be a real risk of having many tiny
pages.

I wonder if we could do something in-between where we allow different page
sizes for different columns, but require that the row ranges for pages of
different columns either are the same or one contains the other. I.e. if
you have row ranges [a, b) and [c, d) from two different columns, then
either they don't overlap (c >= b || a >= d) or one contains the other (c
>= a && d <= b) || (a >= c && b <= d)

E.g. if you have three columns, small, medium and large that fit 50000,
20000 and 1020 values per page, you could meet the above constraint with
the following set of row ranges where pages are truncated when a page with
an enclosing row range is full.

Small: [0, 50000), [50000, 100000)
Medium: [0, 20000), [20000, 40000), [40000, 50000), [50000, 70000), [70000,
90000), [90000, 100000)
Large: [0, 1020), [1020, 2040), [2040, 3060), ..., [19390, 20000), ...

That seems like it would simplify the calculation of the relevant pages on
the read path, although you would still need to have logic to skip values
within a page.



On Sun, Aug 19, 2018 at 1:57 AM, Uwe L. Korn <[email protected]> wrote:

> Hello Gabor,
>
> comment in-line
>
> > The implementation was done based on the original design of column
> indexes
> > <https://github.com/apache/parquet-format/blob/master/PageIndex.md>
> meaning
> > that no row alignment is required between the pages (the only requirement
> > is for the pages to respect row boundaries).
> > As we described in the preview parquet sync the desing/implementation
> would
> > be much more clear (and might perform a bit better) if the row alignment
> > would also be required. I would be happy to modify the implementation if
> we
> > would decide to align pages on rows.* I would like to have a final
> decision
> > on this topic before merging this feature.*
>
> I'm not 100% certain what "row alignment" could mean, I thinking of two
> very different things.
>
> 1.  It would mean that all columns in a RowGroup would have the same
> number of pages that would all align on the same set of rows.
> 2. It would mean that pages are only split on the highest nesting level,
> i.e. only split on what would be the horizontal boundaries on a 2D-table.
> I.e. not splitting any cells of this table structure.
>
> If the interpretation is 1, then I think this is generating far too much
> pages for very sparse columns. But I'm guessing that the interpretation is
> rather 2 and there I would be more interested the concerns that were raised
> in the sync. This type of alignment also is something that made me some
> headaches when implementing things in parquet-cpp. From a Parquet
> developer's perspective, this would really ease the implementation but I'm
> wondering if there are use-cases where a single cell of a table becomes
> larger than what we would normally put into a page.
>
> Uwe
>

Reply via email to