[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

Gabor Szadovszky (Jira) Fri, 27 Mar 2020 01:54:10 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068431#comment-17068431
 ]


Gabor Szadovszky commented on PARQUET-1830:
-------------------------------------------

[~FelixKJose], the feature of having a vectorized API in parquet-mr was only a 
topic in some of our discussions. No efforts have been made to design/implement 
it. 
It is unfortunate that both Spark (and Hive) were implemented their own way of 
vectorization by using parquet-mr internal API (e.g. reading pages directly) 
instead of having something common in parquet-mr. To have such an API designed 
and implemented properly we need design input from our users.

However, to support column indexes in Spark we might have some other approaches:
* As Spark already use some internal API of parquet-mr we can step forward and 
implement the page skipping mechanism that is implemented in parquet-mr.
   pros: might be a quicker solution if Spark community has resources to 
implement it
   cons: duplicating code, increasing parquet related code outside of parquet-mr
* Having a simpler (not vectorized) API in parquet-mr that puts an abstraction 
layer on top of pages (by reading the triplets of value, definition level and 
repetition level from a row group)
   pros: cleaner API in parquet-mr, possibly cleaner code in Spark, hiding the 
page skipping mechanism introduced by column indexes
   cons: lower level API cannot be used anymore (e.g. Spark's own vectorized 
RLE decoder)

What do you think?

> Vectorized API to support Column Index in Apache Spark
> ------------------------------------------------------
>
>                 Key: PARQUET-1830
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1830
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Felix Kizhakkel Jose
>            Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

Reply via email to