[
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068431#comment-17068431
]
Gabor Szadovszky commented on PARQUET-1830:
-------------------------------------------
[~FelixKJose], the feature of having a vectorized API in parquet-mr was only a
topic in some of our discussions. No efforts have been made to design/implement
it.
It is unfortunate that both Spark (and Hive) were implemented their own way of
vectorization by using parquet-mr internal API (e.g. reading pages directly)
instead of having something common in parquet-mr. To have such an API designed
and implemented properly we need design input from our users.
However, to support column indexes in Spark we might have some other approaches:
* As Spark already use some internal API of parquet-mr we can step forward and
implement the page skipping mechanism that is implemented in parquet-mr.
pros: might be a quicker solution if Spark community has resources to
implement it
cons: duplicating code, increasing parquet related code outside of parquet-mr
* Having a simpler (not vectorized) API in parquet-mr that puts an abstraction
layer on top of pages (by reading the triplets of value, definition level and
repetition level from a row group)
pros: cleaner API in parquet-mr, possibly cleaner code in Spark, hiding the
page skipping mechanism introduced by column indexes
cons: lower level API cannot be used anymore (e.g. Spark's own vectorized
RLE decoder)
What do you think?
> Vectorized API to support Column Index in Apache Spark
> ------------------------------------------------------
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Affects Versions: 1.11.0
> Reporter: Felix Kizhakkel Jose
> Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its
> seems like Apache Spark doesn't support Column Index until we disable
> vectorizedReader in Spark - which will have other performance implications.
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already
> implemented or any pull request for the same?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)