Hello Gabor,

comment in-line

> The implementation was done based on the original design of column indexes
> <https://github.com/apache/parquet-format/blob/master/PageIndex.md> meaning
> that no row alignment is required between the pages (the only requirement
> is for the pages to respect row boundaries).
> As we described in the preview parquet sync the desing/implementation would
> be much more clear (and might perform a bit better) if the row alignment
> would also be required. I would be happy to modify the implementation if we
> would decide to align pages on rows.* I would like to have a final decision
> on this topic before merging this feature.*

I'm not 100% certain what "row alignment" could mean, I thinking of two very 
different things.

1.  It would mean that all columns in a RowGroup would have the same number of 
pages that would all align on the same set of rows.
2. It would mean that pages are only split on the highest nesting level, i.e. 
only split on what would be the horizontal boundaries on a 2D-table. I.e. not 
splitting any cells of this table structure.

If the interpretation is 1, then I think this is generating far too much pages 
for very sparse columns. But I'm guessing that the interpretation is rather 2 
and there I would be more interested the concerns that were raised in the sync. 
This type of alignment also is something that made me some headaches when 
implementing things in parquet-cpp. From a Parquet developer's perspective, 
this would really ease the implementation but I'm wondering if there are 
use-cases where a single cell of a table becomes larger than what we would 
normally put into a page.

Uwe

Reply via email to