Hi, The implementation of column index (writing and filtering) is almost done. All the implementation work was done under the PARQUET-1201 <https://issues.apache.org/jira/browse/PARQUET-1201>. Subtasks were used to decompose the work. Every change made was done on the separate feature branch column-indexes <https://github.com/apache/parquet-mr/tree/column-indexes>. Only 2 small fixes/improvements are *waiting for review* (PARQUET-1389 <https://issues.apache.org/jira/browse/PARQUET-1389> and PARQUET-1386 <https://issues.apache.org/jira/browse/PARQUET-1386>). All the other work have already been reviewed. After the successful review of the 2 remaining modifications I would like to merge the feature branch to master. *Any review/comment related to these modifications or the whole branch is welcomed.*
The implementation was done based on the original design of column indexes <https://github.com/apache/parquet-format/blob/master/PageIndex.md> meaning that no row alignment is required between the pages (the only requirement is for the pages to respect row boundaries). As we described in the preview parquet sync the desing/implementation would be much more clear (and might perform a bit better) if the row alignment would also be required. I would be happy to modify the implementation if we would decide to align pages on rows.* I would like to have a final decision on this topic before merging this feature.* I have an *improvement idea* about handling column indexes in case of the data is not sorted. I am curious about your oppinions. If the data is sorted (ascending/descending) the column index based filtering has high benefits. If the data is random or similarly unordered the column indexes quite useless. But, there are partial orderings might present where column indexes can be used. My idea is to check characteristics min/max values of the column index before writing and if it is UNORDERED we only write it if it seems to be useful. E.g. how much the min-max ranges are overlapping. If the overlapping is larger than e.g. 90% then column index based filtering is not useful in most of the time. Example of a useful UNORDERED column index [min, max]: [10, 20] [0, 5] [15, 25] [-10, 0] Example of a ~useless UNORDERED column index: [0, 20] [1, 19] [0, 19] [1, 21] Thanks a lot, Gabor
