[
https://issues.apache.org/jira/browse/PARQUET-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636703#comment-16636703
]
Gabor Szadovszky commented on PARQUET-1415:
-------------------------------------------
Decide if we need to work on this issue based on the results of the benchmarks.
> Improve logic when to write column indexes
> ------------------------------------------
>
> Key: PARQUET-1415
> URL: https://issues.apache.org/jira/browse/PARQUET-1415
> Project: Parquet
> Issue Type: Improvement
> Reporter: Gabor Szadovszky
> Assignee: Gabor Szadovszky
> Priority: Major
>
> Currently, we always write column indexes. In case of the data is ordered
> (ASCENDING or DESCENDING) the filtering would highly benefit from column
> indexes. While, if the data is UNORDERED it is not obvious if ordering based
> on column indexes would make sense. For example if the data is random then
> the min/max values of the different pages might be close to each other so in
> most cases filtering based on these values would not drop any of the pages.
> In the other hand UNORDERED values does not mean that the values are random.
> It can happen that the values are clustered or semi-ordered. We shall
> discover these cases somehow before writing the column indexes and write only
> if the min/max values for the pages do not overlap too much.
> Another simple case if we have only one page. In this case writing column
> indexes is useless.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)