[
https://issues.apache.org/jira/browse/SPARK-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davies Liu resolved SPARK-10395.
--------------------------------
Resolution: Fixed
Fix Version/s: 1.6.0
Issue resolved by pull request 8553
[https://github.com/apache/spark/pull/8553]
> Simplify CatalystReadSupport
> ----------------------------
>
> Key: SPARK-10395
> URL: https://issues.apache.org/jira/browse/SPARK-10395
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.5.0
> Reporter: Cheng Lian
> Assignee: Cheng Lian
> Priority: Minor
> Fix For: 1.6.0
>
>
> The API interface of Parquet {{ReadSupport}} is a little bit over complicated
> because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3
> and prior), {{ReadSupport}} need to be instantiated and initialized twice on
> both driver side and executor side. The {{init()}} method is for driver side
> initialization, while {{prepareForRead()}} is for executor side. However,
> starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}}
> is only instantiated and initialized on executor side. So, theoretically,
> now it's totally fine to combine these two methods into a single
> initialization method. The only reason (I could think of) to still have them
> here is for parquet-mr API backwards-compatibility.
> Due to this reason, we no longer need to rely on {{ReadContext}} to pass
> requested schema from {{init()}} to {{prepareForRead()}}, using a private
> `var` for requested schema in {{CatalystReadSupport}} would be enough.
> Another thing is that, after removing the old Parquet support code, now we
> always set Catalyst requested schema properly when reading Parquet files. So
> all those "fallback" logic in {{CatalystReadSupport}} is now redundant.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]