Hi,

The implementation of column index (writing and filtering) is almost done.
All the implementation work was done under the PARQUET-1201
<https://issues.apache.org/jira/browse/PARQUET-1201>. Subtasks were used to
decompose the work.
Every change made was done on the separate feature branch column-indexes
<https://github.com/apache/parquet-mr/tree/column-indexes>. Only 2 small
fixes/improvements are *waiting for review* (PARQUET-1389
<https://issues.apache.org/jira/browse/PARQUET-1389> and PARQUET-1386
<https://issues.apache.org/jira/browse/PARQUET-1386>). All the other work
have already been reviewed. After the successful review of the 2 remaining
modifications I would like to merge the feature branch to master. *Any
review/comment related to these modifications or the whole branch is
welcomed.*

The implementation was done based on the original design of column indexes
<https://github.com/apache/parquet-format/blob/master/PageIndex.md> meaning
that no row alignment is required between the pages (the only requirement
is for the pages to respect row boundaries).
As we described in the preview parquet sync the desing/implementation would
be much more clear (and might perform a bit better) if the row alignment
would also be required. I would be happy to modify the implementation if we
would decide to align pages on rows.* I would like to have a final decision
on this topic before merging this feature.*

I have an *improvement idea* about handling column indexes in case of the
data is not sorted. I am curious about your oppinions.
If the data is sorted (ascending/descending) the column index based
filtering has high benefits. If the data is random or similarly unordered
the column indexes quite useless. But, there are partial orderings might
present where column indexes can be used.
My idea is to check characteristics min/max values of the column index
before writing and if it is UNORDERED we only write it if it seems to be
useful. E.g. how much the min-max ranges are overlapping. If the
overlapping is larger than e.g. 90% then column index based filtering is
not useful in most of the time.
Example of a useful UNORDERED column index [min, max]:
[10, 20]
[0, 5]
[15, 25]
[-10, 0]
Example of a ~useless UNORDERED column index:
[0, 20]
[1, 19]
[0, 19]
[1, 21]

Thanks a lot,
Gabor

Reply via email to