On the point of this already being possible today, if a filter is covering only a single column you would need to read the headers in all of the columns to confirm that you could skip the pages because the row counts happen to match. I think it makes sense to declare this in the row group level metadata.
On Thu, Nov 19, 2015 at 12:14 PM, Nong Li <[email protected]> wrote: > Inline. > > On Thu, Nov 19, 2015 at 11:38 AM, Julien Le Dem <[email protected]> wrote: > > > Thanks Nong for looking into this! > > Several comments: > > - I would think that this is already possible today without changing the > > format. With Page format v2 we have the row count per page so you know by > > looking at the row count if they are aligned or not. That said adding an > > optional field is backward compatible and would be fine. > > > > That's true. The writer can just do this and reader can figure it out. I > think this is problematic > because it is kind of complex to implement. Furthermore, I prefer this in > the metadata to better > support filter push down. I'd argue parquet-mr and other libraries should > *not* implement row > by row predicate evaluation. The engine using the library can do it > probably more efficiently. > We should at least have a way to restrict filter only when it is efficient > (i.e. skipping row groups > and pages). Having it in the metadata allows better negotiation on which > filters are pushable. > > > > - The writer can be modified to explicitly align pages. Would we want > this > > to be optional? > > > > I'm not sure what that means. If the writer knows and chooses to do this, > it would set the > metdata. > > > > - I don't think we have implemented that yet but the decoders could have > an > > efficient skip(nValues) methods (RLE or dictionary encoding for example > > would skip very efficiently). Of course if there are repeated fields you > > need to decode the repetition levels first to know how many values to > skip. > > That would be less efficient than directly skipping more granular pages. > > But we may want to have both to mitigate tradeoffs between compression > and > > skipping. > > > Yea, I think the writer can choose depending how storage conscientious they > are. Effecient > skip for some encodings is good but it still requires quite a bit more work > than skipping > entire pages. > > > > Julien > > > > > > On Thu, Nov 19, 2015 at 11:07 AM, Todd Lipcon <[email protected]> wrote: > > > > > I might be missing something, but what's the point of ensuring > > cross-column > > > alignment, so long as you have the record _count_ per page, and ensure > > that > > > a single record doesn't split across pages? i.e. you need to be > > > "record-aligned" within a column, but not sure why you have to > guarantee > > > that page "N" in column 1 has the same records as page "N" in column 2, > > > since you can still use the ordinal record indexes to skip over > > irrelevant > > > pages. > > > There is no ordinal index and even if there was I'm not sure how efficient > it would be for > this case.The use case here is not single row lookups but to be able to > take advantage of > skipping using the column stats. > > > > > > -Todd > > > > > > On Thu, Nov 19, 2015 at 10:57 AM, Nong Li <[email protected]> wrote: > > > > > > > I'd like to propose a change to the format spec to add metadata to > > > indicate > > > > that pages > > > > in arow group are record aligned. In other words page N consists of > the > > > > same records > > > > across all columns. The benefit of this would be to allow skipping at > > the > > > > page level. > > > > > > > > The change would add a single optional boolean at the row group > > metadata > > > > level and > > > > only supported with DataPageHeaderV2 (V1 doesn't have a counter for > the > > > > number of > > > > records in a page, only number of values). This is compatible with > > > existing > > > > readers > > > > which can simply ignore this. > > > > > > > > Background: > > > > We originally picked to have roughly fixed byte size pages to > maximize > > > > storage density. > > > > Pages are defined as the unit of indivisible work (e.g. compression > > > across > > > > the entire page > > > > or encoding that need to be bulk decoded). A smaller page size > improves > > > > single row > > > > latency(i.e. a 1MB page means reading a single value requires > decoding > > > > 1MB). A larger > > > > page size generally improves the efficiency of general purpose > > > compression > > > > algorithms. > > > > Since the number of bytes per record varies quite a lot between > > columns, > > > it > > > > is not possible > > > > to have record aligned pages that are roughly the same size. > > > > > > > > This change would mean the more compressible columns are now smaller > > (in > > > > terms of > > > > bytes in a page) and might not compress as well using general purpose > > > > compression. As > > > > these are already small and compressed, the value of general purpose > > > > compression is low > > > > and the cost to overall storage footprint is small. > > > > > > > > The benefit would be to allow skipping pages using the page level > > > > statistics which can speed > > > > up filtering quite a lot. > > > > > > > > A ballpark for the number of values is 8K, which results in roughly > the > > > > same page size for > > > > values that are 8 bytes per row. > > > > > > > > > > > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > > > > > > > -- > > Julien > > >
