Re: Status of column index in parquet-mr

Gabor Szadovszky Wed, 22 Aug 2018 06:04:39 -0700

Hi,

I would really appriciate if someone would review PARQUET-1389
<https://issues.apache.org/jira/browse/PARQUET-1389> and PARQUET-1386
<https://issues.apache.org/jira/browse/PARQUET-1386>. After this two
outstanding modifications I would be able to merge the whole feature branch
column-indexes <https://github.com/apache/parquet-mr/tree/column-indexes>.
Also, feel free to comment on any modification on the branch itself.


Any opinions about the improvement idea for writing column indexes only if
it would result better filtering?

Thanks a lot,
Gabor

On Tue, Aug 21, 2018 at 9:51 AM Gabor Szadovszky <
[email protected]> wrote:

> Hi,
>
> Row alignment in my wording was the 1st definition in Uwe's mail. From
> column index based filtering point of view the implementation and the logic
> would be much simplier in this case but I do understand that the pages
> sizes would not be optimal. It seems, the community is against the row
> alignment so I would close this topic.
> The 2nd definition is required for column indexes and already mentioned in
> the spec. For page v2 we already do the same anyway.
> Tim, as you mentioned it would still require skipping values. The biggest
> pain in my implementation was to pass the required values through the API
> to implement the skipping. The calculation itself was not that complicated.
>
> Thanks a lot,
> Gabor
>
> On Mon, Aug 20, 2018 at 7:51 PM Tim Armstrong
> <[email protected]> wrote:
>
>> I had a similar concern to Uwe - if there are a large number of columns
>> with variable size there does seem to be a real risk of having many tiny
>> pages.
>>
>> I wonder if we could do something in-between where we allow different page
>> sizes for different columns, but require that the row ranges for pages of
>> different columns either are the same or one contains the other. I.e. if
>> you have row ranges [a, b) and [c, d) from two different columns, then
>> either they don't overlap (c >= b || a >= d) or one contains the other (c
>> >= a && d <= b) || (a >= c && b <= d)
>>
>> E.g. if you have three columns, small, medium and large that fit 50000,
>> 20000 and 1020 values per page, you could meet the above constraint with
>> the following set of row ranges where pages are truncated when a page with
>> an enclosing row range is full.
>>
>> Small: [0, 50000), [50000, 100000)
>> Medium: [0, 20000), [20000, 40000), [40000, 50000), [50000, 70000),
>> [70000,
>> 90000), [90000, 100000)
>> Large: [0, 1020), [1020, 2040), [2040, 3060), ..., [19390, 20000), ...
>>
>> That seems like it would simplify the calculation of the relevant pages on
>> the read path, although you would still need to have logic to skip values
>> within a page.
>>
>>
>>
>> On Sun, Aug 19, 2018 at 1:57 AM, Uwe L. Korn <[email protected]> wrote:
>>
>> > Hello Gabor,
>> >
>> > comment in-line
>> >
>> > > The implementation was done based on the original design of column
>> > indexes
>> > > <https://github.com/apache/parquet-format/blob/master/PageIndex.md>
>> > meaning
>> > > that no row alignment is required between the pages (the only
>> requirement
>> > > is for the pages to respect row boundaries).
>> > > As we described in the preview parquet sync the desing/implementation
>> > would
>> > > be much more clear (and might perform a bit better) if the row
>> alignment
>> > > would also be required. I would be happy to modify the implementation
>> if
>> > we
>> > > would decide to align pages on rows.* I would like to have a final
>> > decision
>> > > on this topic before merging this feature.*
>> >
>> > I'm not 100% certain what "row alignment" could mean, I thinking of two
>> > very different things.
>> >
>> > 1.  It would mean that all columns in a RowGroup would have the same
>> > number of pages that would all align on the same set of rows.
>> > 2. It would mean that pages are only split on the highest nesting level,
>> > i.e. only split on what would be the horizontal boundaries on a
>> 2D-table.
>> > I.e. not splitting any cells of this table structure.
>> >
>> > If the interpretation is 1, then I think this is generating far too much
>> > pages for very sparse columns. But I'm guessing that the interpretation
>> is
>> > rather 2 and there I would be more interested the concerns that were
>> raised
>> > in the sync. This type of alignment also is something that made me some
>> > headaches when implementing things in parquet-cpp. From a Parquet
>> > developer's perspective, this would really ease the implementation but
>> I'm
>> > wondering if there are use-cases where a single cell of a table becomes
>> > larger than what we would normally put into a page.
>> >
>> > Uwe
>> >
>>
>

Re: Status of column index in parquet-mr

Reply via email to