On Thu, Nov 19, 2015 at 12:14 PM, Nong Li <[email protected]> wrote: > > > > There is no ordinal index and even if there was I'm not sure how efficient > it would be for > this case.The use case here is not single row lookups but to be able to > take advantage of > skipping using the column stats. > > Sorry, but "index" here I don't mean a B-tree index structure. Rather, I mean the ordinal offset within the Parquet file (i.e. "record #12345")
If I follow you correctly, you're advocating cross-column page alignment so that you get an equal number of pages even if one of the columns is highly compressible. But, that is a writer side decision, and assuming you've implented "Skip(int numRecordsToSkip)" on the reader, it seems like the reader doesn't need to know about whether pages are aligned. Sure, the skipping might not be as efficient if trying to skip into the middle of a large page, but, especially in the case of highly compressible data (RLE or bitpacking) skipping into the middle of a page is pretty easy and efficient. -Todd -- Todd Lipcon Software Engineer, Cloudera
