[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

Gabor Szadovszky (Jira) Wed, 15 Apr 2020 06:20:11 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084075#comment-17084075
 ]


Gabor Szadovszky commented on PARQUET-1739:
-------------------------------------------

[~yumwang],

Have you succeeded to implement the page skipping mechanism in Spark? Without 
that you may only see the overhead of the column-indexes and not the benefit.
Meanwhile, even if the page skipping is implemented there might be a little 
performance degradation in case of the data is not sorted at all (the min/max 
values are very similar for the different pages). In this case the 
column/offset index reading I/O is the overhead while we cannot drop any pages 
based on the min/max values so we read the same amount of data as we would not 
have column indexes.

>From column index point of view we should not have too much difference between 
>the runs if no ppd is used (no filter is set in the parquet API).

> Make Spark SQL support Column indexes
> -------------------------------------
>
>                 Key: PARQUET-1739
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1739
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>             Fix For: 1.11.1
>
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

Reply via email to