[ 
https://issues.apache.org/jira/browse/PARQUET-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636703#comment-16636703
 ] 

Gabor Szadovszky commented on PARQUET-1415:
-------------------------------------------

Decide if we need to work on this issue based on the results of the benchmarks.

> Improve logic when to write column indexes
> ------------------------------------------
>
>                 Key: PARQUET-1415
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1415
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>
> Currently, we always write column indexes. In case of the data is ordered 
> (ASCENDING or DESCENDING) the filtering would highly benefit from column 
> indexes. While, if the data is UNORDERED it is not obvious if ordering based 
> on column indexes would make sense. For example if the data is random then 
> the min/max values of the different pages might be close to each other so in 
> most cases filtering based on these values would not drop any of the pages. 
> In the other hand UNORDERED values does not mean that the values are random. 
> It can happen that the values are clustered or semi-ordered. We shall 
> discover these cases somehow before writing the column indexes and write only 
> if the min/max values for the pages do not overlap too much.
> Another simple case if we have only one page. In this case writing column 
> indexes is useless. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to