Vladimir Picka created SPARK-10659:
--------------------------------------
Summary: SparkSQL saveAsParquetFile does not preserve REQUIRED
(not nullable) flag in schema
Key: SPARK-10659
URL: https://issues.apache.org/jira/browse/SPARK-10659
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.5.0, 1.4.1, 1.4.0, 1.3.1, 1.3.0
Reporter: Vladimir Picka
DataFrames currently automatically promotes all Parquet schema fields to
optional when they are written to an empty directory. The problem remains in
v1.5.0.
The culprit is this code:
val relation = if (doInsertion) {
// This is a hack. We always set
nullable/containsNull/valueContainsNull to true
// for the schema of a parquet data.
val df =
sqlContext.createDataFrame(
data.queryExecution.toRdd,
data.schema.asNullable)
val createdRelation =
createRelation(sqlContext, parameters,
df.schema).asInstanceOf[ParquetRelation2]
createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
createdRelation
}
which was implemented as part of this PR:
https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
This very unexpected behaviour for some use cases when files are read from one
place and written to another like small file packing - it ends up with
incompatible files because required can't be promoted to optional normally. It
is essence of a schema that it enforces "required" invariant on data. It should
be supposed that it is intended.
I believe that a better approach is to have default behaviour to keep schema as
is and provide f.e. a builder method or option to allow forcing to optional.
Right now we have to overwrite private API so that our files are rewritten as
is with all its perils.
Vladimir
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]