Vladimir Picka created SPARK-10659: -------------------------------------- Summary: SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema Key: SPARK-10659 URL: https://issues.apache.org/jira/browse/SPARK-10659 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0, 1.4.1, 1.4.0, 1.3.1, 1.3.0 Reporter: Vladimir Picka
DataFrames currently automatically promotes all Parquet schema fields to optional when they are written to an empty directory. The problem remains in v1.5.0. The culprit is this code: val relation = if (doInsertion) { // This is a hack. We always set nullable/containsNull/valueContainsNull to true // for the schema of a parquet data. val df = sqlContext.createDataFrame( data.queryExecution.toRdd, data.schema.asNullable) val createdRelation = createRelation(sqlContext, parameters, df.schema).asInstanceOf[ParquetRelation2] createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) createdRelation } which was implemented as part of this PR: https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b This very unexpected behaviour for some use cases when files are read from one place and written to another like small file packing - it ends up with incompatible files because required can't be promoted to optional normally. It is essence of a schema that it enforces "required" invariant on data. It should be supposed that it is intended. I believe that a better approach is to have default behaviour to keep schema as is and provide f.e. a builder method or option to allow forcing to optional. Right now we have to overwrite private API so that our files are rewritten as is with all its perils. Vladimir -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org