[
https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383627#comment-14383627
]
Cheng Lian commented on SPARK-6566:
-----------------------------------
Hi [[email protected]], as described in SPARK-5463, we do want to
upgrade Parquet. However, currently we have two concerns:
# The most recent Parquet RC release introduces subtle API incompatibilities
related to filter push-down and Parquet metadata gathering, which I believe
requires more work than the patch you provided if we want everything works
perfectly with the best performance.
# We'd like to wait for the official release of Parquet 1.6.0. This is the
first release for Parquet as an Apache top-level project, so it takes more time
than usual.
We probably will first try to upgrade to a most recent 1.6.0 RC release in
Spark master, and then switch to the official 1.6.0 release in Spark 1.4.0 (and
Spark 1.3.2 if there will be one).
> Update Spark to use the latest version of Parquet libraries
> -----------------------------------------------------------
>
> Key: SPARK-6566
> URL: https://issues.apache.org/jira/browse/SPARK-6566
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.3.0
> Reporter: Konstantin Shaposhnikov
>
> There are a lot of bug fixes in the latest version of parquet (1.6.0rc7).
> E.g. PARQUET-136
> It would be good to update Spark to use the latest parquet version.
> The following changes are required:
> {code}
> diff --git a/pom.xml b/pom.xml
> index 5ad39a9..095b519 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -132,7 +132,7 @@
> <!-- Version used for internal directory structure -->
> <hive.version.short>0.13.1</hive.version.short>
> <derby.version>10.10.1.1</derby.version>
> - <parquet.version>1.6.0rc3</parquet.version>
> + <parquet.version>1.6.0rc7</parquet.version>
> <jblas.version>1.2.3</jblas.version>
> <jetty.version>8.1.14.v20131031</jetty.version>
> <orbit.version>3.0.0.v201112011016</orbit.version>
> {code}
> and
> {code}
> ---
> a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
> +++
> b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
> @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat
> globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
> mergedMetadata, globalMetaData.getCreatedBy)
>
> - val readContext = getReadSupport(configuration).init(
> + val readContext =
> ParquetInputFormat.getReadSupportInstance(configuration).init(
> new InitContext(configuration,
> globalMetaData.getKeyValueMetaData,
> globalMetaData.getSchema))
> {code}
> I am happy to prepare a pull request if necessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]