[jira] [Commented] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API

Apache Spark (JIRA) Sun, 28 Jun 2015 03:47:57 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604631#comment-14604631
 ]


Apache Spark commented on SPARK-8690:
-------------------------------------

User 'thegiive' has created a pull request for this issue:
https://github.com/apache/spark/pull/7070

> Add a setting to disable SparkSQL parquet schema merge by using datasource 
> API 
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-8690
>                 URL: https://issues.apache.org/jira/browse/SPARK-8690
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: all
>            Reporter: thegiive
>            Priority: Minor
>
> We need a general config to disable the parquet schema merge feature. 
> Our sparkSQL application requirement is 
> # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't 
> want increase too much read parquet time. Around 2000 parquet file,  the 
> schema is the same. So we don't need  schema merge feature
> # We need to use datasource API's feature like partition discovery. So we 
> cannot use Spark 1.2 or pervious version 
> # We have a lot of SparkSQL product. We use 
> *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to 
> change the application code. One setting to disable this feature is what we 
> want 
> In  1.4, we have serval method. But both of them cannot perfect match our use 
> case 
> # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 
> 1,3. But it will use old parquet API and fail in requirement 2 
> # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> 
> "false" ))  will meet requirement 1,2. But it need to change a lot of code we 
> use in parquet load. 
> # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default 
> version of parquet will increase the load time from 1~5 sec to 100 sec. It 
> will fail requirement 1. 
> # Try PR 5231 config. But it  cannot disable schema merge. 
> I think it is better to use a config to disable datasource API's schema merge 
> feature. A PR will be provide later 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API

Reply via email to