On the point of this already being possible today, if a filter is covering
only a single column you would need to read the headers in all of the
columns to confirm that you could skip the pages because the row counts
happen to match. I think it makes sense to declare this in the row group
level metadata.

On Thu, Nov 19, 2015 at 12:14 PM, Nong Li <[email protected]> wrote:

> Inline.
>
> On Thu, Nov 19, 2015 at 11:38 AM, Julien Le Dem <[email protected]> wrote:
>
> > Thanks Nong for looking into this!
> > Several comments:
> > - I would think that this is already possible today without changing the
> > format. With Page format v2 we have the row count per page so you know by
> > looking at the row count if they are aligned or not. That said adding an
> > optional field is backward compatible and would be fine.
> >
>
> That's true. The writer can just do this and reader can figure it out. I
> think this is problematic
> because it is kind of complex to implement. Furthermore, I prefer this in
> the metadata to better
> support filter push down. I'd argue parquet-mr and other libraries should
> *not* implement row
> by row predicate evaluation. The engine using the library can do it
> probably more efficiently.
> We should at least have a way to restrict filter only when it is efficient
> (i.e. skipping row groups
> and pages). Having it in the metadata allows better negotiation on which
> filters are pushable.
>
>
> > - The writer can be modified to explicitly align pages. Would we want
> this
> > to be optional?
> >
>
> I'm not sure what that means. If the writer knows and chooses to do this,
> it would set the
> metdata.
>
>
> > - I don't think we have implemented that yet but the decoders could have
> an
> > efficient skip(nValues) methods (RLE or dictionary encoding for example
> > would skip very efficiently). Of course if there are repeated fields you
> > need to decode the repetition levels first to know how many values to
> skip.
> > That would be less efficient than directly skipping more granular pages.
> > But we may want to have both to mitigate tradeoffs between compression
> and
> > skipping.
> >
> Yea, I think the writer can choose depending how storage conscientious they
> are. Effecient
> skip for some encodings is good but it still requires quite a bit more work
> than skipping
> entire pages.
>
>
> > Julien
> >
> >
> > On Thu, Nov 19, 2015 at 11:07 AM, Todd Lipcon <[email protected]> wrote:
> >
> > > I might be missing something, but what's the point of ensuring
> > cross-column
> > > alignment, so long as you have the record _count_ per page, and ensure
> > that
> > > a single record doesn't split across pages? i.e. you need to be
> > > "record-aligned" within a column, but not sure why you have to
> guarantee
> > > that page "N" in column 1 has the same records as page "N" in column 2,
> > > since you can still use the ordinal record indexes to skip over
> > irrelevant
> > > pages.
> >
> There is no ordinal index and even if there was I'm not sure how efficient
> it would be for
> this case.The use case here is not single row lookups but to be able to
> take advantage of
> skipping using the column stats.
>
> >
> > > -Todd
> > >
> > > On Thu, Nov 19, 2015 at 10:57 AM, Nong Li <[email protected]> wrote:
> > >
> > > > I'd like to propose a change to the format spec to add metadata to
> > > indicate
> > > > that pages
> > > > in arow group are record aligned. In other words page N consists of
> the
> > > > same records
> > > > across all columns. The benefit of this would be to allow skipping at
> > the
> > > > page level.
> > > >
> > > > The change would add a single optional boolean at the row group
> > metadata
> > > > level and
> > > > only supported with DataPageHeaderV2 (V1 doesn't have a counter for
> the
> > > > number of
> > > > records in a page, only number of values). This is compatible with
> > > existing
> > > > readers
> > > > which can simply ignore this.
> > > >
> > > > Background:
> > > > We originally picked to have roughly fixed byte size pages to
> maximize
> > > > storage density.
> > > > Pages are defined as the unit of indivisible work (e.g. compression
> > > across
> > > > the entire page
> > > > or encoding that need to be bulk decoded). A smaller page size
> improves
> > > > single row
> > > > latency(i.e. a 1MB page means reading a single value requires
> decoding
> > > > 1MB). A larger
> > > > page size generally improves the efficiency of general purpose
> > > compression
> > > > algorithms.
> > > > Since the number of bytes per record varies quite a lot between
> > columns,
> > > it
> > > > is not possible
> > > > to have record aligned pages that are roughly the same size.
> > > >
> > > > This change would mean the more compressible columns are now smaller
> > (in
> > > > terms of
> > > > bytes in a page) and might not compress as well using general purpose
> > > > compression. As
> > > > these are already small and compressed, the value of general purpose
> > > > compression is low
> > > > and the cost to overall storage footprint is small.
> > > >
> > > > The benefit would be to allow skipping pages using the page level
> > > > statistics which can speed
> > > > up filtering quite a lot.
> > > >
> > > > A ballpark for the number of values is 8K, which results in roughly
> the
> > > > same page size for
> > > > values that are 8 bytes per row.
> > > >
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> >
> > --
> > Julien
> >
>

Reply via email to