Re: Status of column index in parquet-mr

Gabor Szadovszky Tue, 21 Aug 2018 00:52:08 -0700

Hi,

Row alignment in my wording was the 1st definition in Uwe's mail. From
column index based filtering point of view the implementation and the logic
would be much simplier in this case but I do understand that the pages
sizes would not be optimal. It seems, the community is against the row
alignment so I would close this topic.
The 2nd definition is required for column indexes and already mentioned in
the spec. For page v2 we already do the same anyway.
Tim, as you mentioned it would still require skipping values. The biggest
pain in my implementation was to pass the required values through the API
to implement the skipping. The calculation itself was not that complicated.


Thanks a lot,
Gabor

On Mon, Aug 20, 2018 at 7:51 PM Tim Armstrong
<[email protected]> wrote:

> I had a similar concern to Uwe - if there are a large number of columns
> with variable size there does seem to be a real risk of having many tiny
> pages.
>
> I wonder if we could do something in-between where we allow different page
> sizes for different columns, but require that the row ranges for pages of
> different columns either are the same or one contains the other. I.e. if
> you have row ranges [a, b) and [c, d) from two different columns, then
> either they don't overlap (c >= b || a >= d) or one contains the other (c
> >= a && d <= b) || (a >= c && b <= d)
>
> E.g. if you have three columns, small, medium and large that fit 50000,
> 20000 and 1020 values per page, you could meet the above constraint with
> the following set of row ranges where pages are truncated when a page with
> an enclosing row range is full.
>
> Small: [0, 50000), [50000, 100000)
> Medium: [0, 20000), [20000, 40000), [40000, 50000), [50000, 70000), [70000,
> 90000), [90000, 100000)
> Large: [0, 1020), [1020, 2040), [2040, 3060), ..., [19390, 20000), ...
>
> That seems like it would simplify the calculation of the relevant pages on
> the read path, although you would still need to have logic to skip values
> within a page.
>
>
>
> On Sun, Aug 19, 2018 at 1:57 AM, Uwe L. Korn <[email protected]> wrote:
>
> > Hello Gabor,
> >
> > comment in-line
> >
> > > The implementation was done based on the original design of column
> > indexes
> > > <https://github.com/apache/parquet-format/blob/master/PageIndex.md>
> > meaning
> > > that no row alignment is required between the pages (the only
> requirement
> > > is for the pages to respect row boundaries).
> > > As we described in the preview parquet sync the desing/implementation
> > would
> > > be much more clear (and might perform a bit better) if the row
> alignment
> > > would also be required. I would be happy to modify the implementation
> if
> > we
> > > would decide to align pages on rows.* I would like to have a final
> > decision
> > > on this topic before merging this feature.*
> >
> > I'm not 100% certain what "row alignment" could mean, I thinking of two
> > very different things.
> >
> > 1.  It would mean that all columns in a RowGroup would have the same
> > number of pages that would all align on the same set of rows.
> > 2. It would mean that pages are only split on the highest nesting level,
> > i.e. only split on what would be the horizontal boundaries on a 2D-table.
> > I.e. not splitting any cells of this table structure.
> >
> > If the interpretation is 1, then I think this is generating far too much
> > pages for very sparse columns. But I'm guessing that the interpretation
> is
> > rather 2 and there I would be more interested the concerns that were
> raised
> > in the sync. This type of alignment also is something that made me some
> > headaches when implementing things in parquet-cpp. From a Parquet
> > developer's perspective, this would really ease the implementation but
> I'm
> > wondering if there are use-cases where a single cell of a table becomes
> > larger than what we would normally put into a page.
> >
> > Uwe
> >
>

Re: Status of column index in parquet-mr

Reply via email to