[ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10659:
-------------------------------
    Description: 
DataFrames currently automatically promotes all Parquet schema fields to 
optional when they are written to an empty directory. The problem remains in 
v1.5.0.

The culprit is this code:
{code}
val relation = if (doInsertion) {
      // This is a hack. We always set
nullable/containsNull/valueContainsNull to true
      // for the schema of a parquet data.
      val df =
        sqlContext.createDataFrame(
          data.queryExecution.toRdd,
          data.schema.asNullable)
      val createdRelation =
        createRelation(sqlContext, parameters, 
df.schema).asInstanceOf[ParquetRelation2]
      createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
      createdRelation
    }
{code}
which was implemented as part of this PR:
https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b

This very unexpected behaviour for some use cases when files are read from one 
place and written to another like small file packing - it ends up with 
incompatible files because required can't be promoted to optional normally. It 
is essence of a schema that it enforces "required" invariant on data. It should 
be supposed that it is intended.

I believe that a better approach is to have default behaviour to keep schema as 
is and provide f.e. a builder method or option to allow forcing to optional.

Right now we have to overwrite private API so that our files are rewritten as 
is with all its perils.

Vladimir


  was:
DataFrames currently automatically promotes all Parquet schema fields to 
optional when they are written to an empty directory. The problem remains in 
v1.5.0.

The culprit is this code:
val relation = if (doInsertion) {
      // This is a hack. We always set
nullable/containsNull/valueContainsNull to true
      // for the schema of a parquet data.
      val df =
        sqlContext.createDataFrame(
          data.queryExecution.toRdd,
          data.schema.asNullable)
      val createdRelation =
        createRelation(sqlContext, parameters,
df.schema).asInstanceOf[ParquetRelation2]
      createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
      createdRelation
    }

which was implemented as part of this PR:
https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b

This very unexpected behaviour for some use cases when files are read from one 
place and written to another like small file packing - it ends up with 
incompatible files because required can't be promoted to optional normally. It 
is essence of a schema that it enforces "required" invariant on data. It should 
be supposed that it is intended.

I believe that a better approach is to have default behaviour to keep schema as 
is and provide f.e. a builder method or option to allow forcing to optional.

Right now we have to overwrite private API so that our files are rewritten as 
is with all its perils.

Vladimir



> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10659
>                 URL: https://issues.apache.org/jira/browse/SPARK-10659
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>            Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> {code}
> val relation = if (doInsertion) {
>       // This is a hack. We always set
> nullable/containsNull/valueContainsNull to true
>       // for the schema of a parquet data.
>       val df =
>         sqlContext.createDataFrame(
>           data.queryExecution.toRdd,
>           data.schema.asNullable)
>       val createdRelation =
>         createRelation(sqlContext, parameters, 
> df.schema).asInstanceOf[ParquetRelation2]
>       createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>       createdRelation
>     }
> {code}
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing to 
> optional.
> Right now we have to overwrite private API so that our files are rewritten as 
> is with all its perils.
> Vladimir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to