thegiive created SPARK-8690:
-------------------------------
Summary: Add a setting to disable SparkSQL parquet schema merge by
using datasource API
Key: SPARK-8690
URL: https://issues.apache.org/jira/browse/SPARK-8690
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.4.0
Environment: all
Reporter: thegiive
Priority: Minor
We need a general config to disable the parquet schema merge feature.
Our sparkSQL application requirement is
# In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't
want increase too much read parquet time. Around 2000 parquet file, the schema
is the same. So we don't need schema merge feature
# We need to use datasource API's feature like partition discovery. So we
cannot use Spark 1.2 or pervious version
# We have a lot of SparkSQL product. We use *sqlContext.parquetFile(filename)*
to read the parquet file. We don't want to change the application code. One
setting to disable this feature is what we want
In 1.4, we have serval method. But both of them cannot perfect match our use
case
# Set spark.sql.parquet.useDataSourceApi to false. It will match requirement
1,3. But it will use old parquet API and fail in requirement 2
# Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" ->
"false" )) will meet requirement 1,2. But it need to change a lot of code we
use in parquet load.
# Spark 1.4 improve a lot on schema merge than 1.3. But directly use default
version of parquet will increase the load time from 1~5 sec to 100 sec. It will
fail requirement 1.
# Try PR 5231 config. But it cannot disable schema merge.
I think it is better to use a config to disable datasource API's schema merge
feature. A PR will be provide later
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]