[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheng Lian updated SPARK-10659: ------------------------------- Description: DataFrames currently automatically promotes all Parquet schema fields to optional when they are written to an empty directory. The problem remains in v1.5.0. The culprit is this code: {code} val relation = if (doInsertion) { // This is a hack. We always set nullable/containsNull/valueContainsNull to true // for the schema of a parquet data. val df = sqlContext.createDataFrame( data.queryExecution.toRdd, data.schema.asNullable) val createdRelation = createRelation(sqlContext, parameters, df.schema).asInstanceOf[ParquetRelation2] createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) createdRelation } {code} which was implemented as part of this PR: https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b This very unexpected behaviour for some use cases when files are read from one place and written to another like small file packing - it ends up with incompatible files because required can't be promoted to optional normally. It is essence of a schema that it enforces "required" invariant on data. It should be supposed that it is intended. I believe that a better approach is to have default behaviour to keep schema as is and provide f.e. a builder method or option to allow forcing to optional. Right now we have to overwrite private API so that our files are rewritten as is with all its perils. Vladimir was: DataFrames currently automatically promotes all Parquet schema fields to optional when they are written to an empty directory. The problem remains in v1.5.0. The culprit is this code: val relation = if (doInsertion) { // This is a hack. We always set nullable/containsNull/valueContainsNull to true // for the schema of a parquet data. val df = sqlContext.createDataFrame( data.queryExecution.toRdd, data.schema.asNullable) val createdRelation = createRelation(sqlContext, parameters, df.schema).asInstanceOf[ParquetRelation2] createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) createdRelation } which was implemented as part of this PR: https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b This very unexpected behaviour for some use cases when files are read from one place and written to another like small file packing - it ends up with incompatible files because required can't be promoted to optional normally. It is essence of a schema that it enforces "required" invariant on data. It should be supposed that it is intended. I believe that a better approach is to have default behaviour to keep schema as is and provide f.e. a builder method or option to allow forcing to optional. Right now we have to overwrite private API so that our files are rewritten as is with all its perils. Vladimir > DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not > nullable) flag in schema > -------------------------------------------------------------------------------------------------- > > Key: SPARK-10659 > URL: https://issues.apache.org/jira/browse/SPARK-10659 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0 > Reporter: Vladimir Picka > > DataFrames currently automatically promotes all Parquet schema fields to > optional when they are written to an empty directory. The problem remains in > v1.5.0. > The culprit is this code: > {code} > val relation = if (doInsertion) { > // This is a hack. We always set > nullable/containsNull/valueContainsNull to true > // for the schema of a parquet data. > val df = > sqlContext.createDataFrame( > data.queryExecution.toRdd, > data.schema.asNullable) > val createdRelation = > createRelation(sqlContext, parameters, > df.schema).asInstanceOf[ParquetRelation2] > createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) > createdRelation > } > {code} > which was implemented as part of this PR: > https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b > This very unexpected behaviour for some use cases when files are read from > one place and written to another like small file packing - it ends up with > incompatible files because required can't be promoted to optional normally. > It is essence of a schema that it enforces "required" invariant on data. It > should be supposed that it is intended. > I believe that a better approach is to have default behaviour to keep schema > as is and provide f.e. a builder method or option to allow forcing to > optional. > Right now we have to overwrite private API so that our files are rewritten as > is with all its perils. > Vladimir -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org