Hi, I would really appriciate if someone would review PARQUET-1389 <https://issues.apache.org/jira/browse/PARQUET-1389> and PARQUET-1386 <https://issues.apache.org/jira/browse/PARQUET-1386>. After this two outstanding modifications I would be able to merge the whole feature branch column-indexes <https://github.com/apache/parquet-mr/tree/column-indexes>. Also, feel free to comment on any modification on the branch itself.
Any opinions about the improvement idea for writing column indexes only if it would result better filtering? Thanks a lot, Gabor On Tue, Aug 21, 2018 at 9:51 AM Gabor Szadovszky < [email protected]> wrote: > Hi, > > Row alignment in my wording was the 1st definition in Uwe's mail. From > column index based filtering point of view the implementation and the logic > would be much simplier in this case but I do understand that the pages > sizes would not be optimal. It seems, the community is against the row > alignment so I would close this topic. > The 2nd definition is required for column indexes and already mentioned in > the spec. For page v2 we already do the same anyway. > Tim, as you mentioned it would still require skipping values. The biggest > pain in my implementation was to pass the required values through the API > to implement the skipping. The calculation itself was not that complicated. > > Thanks a lot, > Gabor > > On Mon, Aug 20, 2018 at 7:51 PM Tim Armstrong > <[email protected]> wrote: > >> I had a similar concern to Uwe - if there are a large number of columns >> with variable size there does seem to be a real risk of having many tiny >> pages. >> >> I wonder if we could do something in-between where we allow different page >> sizes for different columns, but require that the row ranges for pages of >> different columns either are the same or one contains the other. I.e. if >> you have row ranges [a, b) and [c, d) from two different columns, then >> either they don't overlap (c >= b || a >= d) or one contains the other (c >> >= a && d <= b) || (a >= c && b <= d) >> >> E.g. if you have three columns, small, medium and large that fit 50000, >> 20000 and 1020 values per page, you could meet the above constraint with >> the following set of row ranges where pages are truncated when a page with >> an enclosing row range is full. >> >> Small: [0, 50000), [50000, 100000) >> Medium: [0, 20000), [20000, 40000), [40000, 50000), [50000, 70000), >> [70000, >> 90000), [90000, 100000) >> Large: [0, 1020), [1020, 2040), [2040, 3060), ..., [19390, 20000), ... >> >> That seems like it would simplify the calculation of the relevant pages on >> the read path, although you would still need to have logic to skip values >> within a page. >> >> >> >> On Sun, Aug 19, 2018 at 1:57 AM, Uwe L. Korn <[email protected]> wrote: >> >> > Hello Gabor, >> > >> > comment in-line >> > >> > > The implementation was done based on the original design of column >> > indexes >> > > <https://github.com/apache/parquet-format/blob/master/PageIndex.md> >> > meaning >> > > that no row alignment is required between the pages (the only >> requirement >> > > is for the pages to respect row boundaries). >> > > As we described in the preview parquet sync the desing/implementation >> > would >> > > be much more clear (and might perform a bit better) if the row >> alignment >> > > would also be required. I would be happy to modify the implementation >> if >> > we >> > > would decide to align pages on rows.* I would like to have a final >> > decision >> > > on this topic before merging this feature.* >> > >> > I'm not 100% certain what "row alignment" could mean, I thinking of two >> > very different things. >> > >> > 1. It would mean that all columns in a RowGroup would have the same >> > number of pages that would all align on the same set of rows. >> > 2. It would mean that pages are only split on the highest nesting level, >> > i.e. only split on what would be the horizontal boundaries on a >> 2D-table. >> > I.e. not splitting any cells of this table structure. >> > >> > If the interpretation is 1, then I think this is generating far too much >> > pages for very sparse columns. But I'm guessing that the interpretation >> is >> > rather 2 and there I would be more interested the concerns that were >> raised >> > in the sync. This type of alignment also is something that made me some >> > headaches when implementing things in parquet-cpp. From a Parquet >> > developer's perspective, this would really ease the implementation but >> I'm >> > wondering if there are use-cases where a single cell of a table becomes >> > larger than what we would normally put into a page. >> > >> > Uwe >> > >> >
