[
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084075#comment-17084075
]
Gabor Szadovszky commented on PARQUET-1739:
-------------------------------------------
[~yumwang],
Have you succeeded to implement the page skipping mechanism in Spark? Without
that you may only see the overhead of the column-indexes and not the benefit.
Meanwhile, even if the page skipping is implemented there might be a little
performance degradation in case of the data is not sorted at all (the min/max
values are very similar for the different pages). In this case the
column/offset index reading I/O is the overhead while we cannot drop any pages
based on the min/max values so we read the same amount of data as we would not
have column indexes.
>From column index point of view we should not have too much difference between
>the runs if no ppd is used (no filter is set in the parquet API).
> Make Spark SQL support Column indexes
> -------------------------------------
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.11.0
> Reporter: Yuming Wang
> Assignee: Yuming Wang
> Priority: Major
> Fix For: 1.11.1
>
>
> Make Spark SQL supportĀ Column indexes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)