Cheng Lian created SPARK-10434:
----------------------------------

             Summary: Parquet compatibility with 1.4 is broken when writing 
arrays that may contain nulls
                 Key: SPARK-10434
                 URL: https://issues.apache.org/jira/browse/SPARK-10434
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.0
            Reporter: Cheng Lian
            Assignee: Cheng Lian
            Priority: Critical


When writing arrays that may contain nulls, for example:
{noformat}
StructType(
  StructField(
    "f",
    ArrayType(IntegerType, containsNull = true),
    nullable = false))
{noformat}
Spark 1.4 uses the following schema:
{noformat}
message m {
  required group f (LIST) {
    repeated group bag {
      optional int32 array;
    }
  }
}
{noformat}
This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level 
structure and repeated group name "bag" are borrowed from parquet-hive, while 
the innermost element field name "array" is borrowed from parquet-avro.

However, in Spark 1.5, I failed to notice the latter fact and used a schema in 
purely parquet-hive flavor, namely:
{noformat}
message m {
  required group f (LIST) {
    repeated group bag {
      optional int32 array_element;
    }
  }
}
{noformat}
One of the direct consequence is that, Parquet files containing such array 
fields written by Spark 1.5 can't be read by Spark 1.4 (all array elements 
become null).

To fix this issue, the name of the innermost field should be changed back to 
"array".  Notice that this fix doesn't affect interoperability with Hive 
(saving Parquet files using {{saveAsTable()}} and then read them using Hive).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to