On Thu, Nov 19, 2015 at 12:14 PM, Nong Li <[email protected]> wrote:

>
> >
> There is no ordinal index and even if there was I'm not sure how efficient
> it would be for
> this case.The use case here is not single row lookups but to be able to
> take advantage of
> skipping using the column stats.
>
>
Sorry, but "index" here I don't mean a B-tree index structure. Rather, I
mean the ordinal offset within the Parquet file (i.e. "record #12345")

If I follow you correctly, you're advocating cross-column page alignment so
that you get an equal number of pages even if one of the columns is highly
compressible. But, that is a writer side decision, and assuming you've
implented "Skip(int numRecordsToSkip)" on the reader, it seems like the
reader doesn't need to know about whether pages are aligned. Sure, the
skipping might not be as efficient if trying to skip into the middle of a
large page, but, especially in the case of highly compressible data (RLE or
bitpacking) skipping into the middle of a page is pretty easy and efficient.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to