[
https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604631#comment-14604631
]
Apache Spark commented on SPARK-8690:
-------------------------------------
User 'thegiive' has created a pull request for this issue:
https://github.com/apache/spark/pull/7070
> Add a setting to disable SparkSQL parquet schema merge by using datasource
> API
> -------------------------------------------------------------------------------
>
> Key: SPARK-8690
> URL: https://issues.apache.org/jira/browse/SPARK-8690
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.4.0
> Environment: all
> Reporter: thegiive
> Priority: Minor
>
> We need a general config to disable the parquet schema merge feature.
> Our sparkSQL application requirement is
> # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't
> want increase too much read parquet time. Around 2000 parquet file, the
> schema is the same. So we don't need schema merge feature
> # We need to use datasource API's feature like partition discovery. So we
> cannot use Spark 1.2 or pervious version
> # We have a lot of SparkSQL product. We use
> *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to
> change the application code. One setting to disable this feature is what we
> want
> In 1.4, we have serval method. But both of them cannot perfect match our use
> case
> # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement
> 1,3. But it will use old parquet API and fail in requirement 2
> # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" ->
> "false" )) will meet requirement 1,2. But it need to change a lot of code we
> use in parquet load.
> # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default
> version of parquet will increase the load time from 1~5 sec to 100 sec. It
> will fail requirement 1.
> # Try PR 5231 config. But it cannot disable schema merge.
> I think it is better to use a config to disable datasource API's schema merge
> feature. A PR will be provide later
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]