Re: Reading Parquet in Iceberg

Ryan Blue Wed, 06 Mar 2019 08:32:28 -0800

Anton, you're mostly right. Page skipping in Parquet requires that pages
are aligned on record boundaries. That's a requirement we set in the spec
for this feature. So when column indexes are present, you know that records
are not split across pages. Then the number of records in a page is encoded
in an offset index, so you know how many pages to skip in other columns.


On Wed, Mar 6, 2019 at 4:40 AM Anton Okolnychyi
<[email protected]> wrote:

> Hm, is it fair to say that making dictionary encoding work for predicate
> columns is a way to mitigate the absence of page skipping?
>
> > On 6 Mar 2019, at 12:19, Anton Okolnychyi <[email protected]>
> wrote:
> >
> > Hi,
> >
> > I was going through the code in Iceberg ParquetReader. Could anybody
> confirm or correct my statements below?
> >
> > Right now, Iceberg can filter out row groups in Parquet. Iceberg fetches
> row group stats from the footer and applies ParquetMetricsRowGroupFilter on
> that information. In addition, the footer contains metadata per column
> chunk including its offset. ParquetDictionaryRowGroupFilter uses that
> column chunk metadata to read an optional dictionary page for each column
> chunk. If a dictionary page is present, it will always be at the beginning
> of each column chunk. ParquetDictionaryRowGroupFilter ensures that all
> pages within a column chunk are dictionary encoded when Iceberg filters out
> row groups based on dictionaries.
> >
> > Also, I have a question about skipping individual pages using page
> stats. To the best of my knowledge, this info was originally stored in page
> headers, which made page skipping not as efficient as it could be because
> it required reading all page headers spread out throughout the file. I
> remember some efforts in the Parquet community to add page level statistics
> to the footer.
> >
> > Now let's assume we have page level stats in the footer or have an
> efficient way to collect that info. Then we have a query that covers two
> columns. Using a predicate on the first column, we see that page 3 doesn't
> contain any relevant values, so we can skip the entire page for that
> column. However, we cannot just skip page 3 for the second column as the
> number of values within a page is not fixed and might vary between column
> chunks. Basically, there is no one-to-one mapping between pages.
> >
> > My question is if we can have a relatively efficient page skipping in
> Parquet at this point.
> >
> > Thanks,
> > Anton
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Reading Parquet in Iceberg

Reply via email to