[jira] [Created] (SPARK-8928) CatalystSchemaConverter doesn't stick to behavior of old versions of Spark SQL when dealing with LISTs

Cheng Lian (JIRA) Wed, 08 Jul 2015 16:43:58 -0700

Cheng Lian created SPARK-8928:
---------------------------------

             Summary: CatalystSchemaConverter doesn't stick to behavior of old 
versions of Spark SQL when dealing with LISTs
                 Key: SPARK-8928
                 URL: https://issues.apache.org/jira/browse/SPARK-8928
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.0
            Reporter: Cheng Lian
            Assignee: Cheng Lian



Spark SQL 1.4.x and prior versions doesn't follow Parquet format spec when 
dealing with complex types (because the spec wasn't clear about this part at 
the time Spark SQL Parquet support was authored). SPARK-6777 partly fixes this 
problem with {{CatalystSchemaConverter}} and introduces a feature flag to 
indicate whether we should stick to Parquet format spec (standard mode) or 
older Spark versions behavior (compatible mode). However, when dealing with 
LISTs (i.e. Spark SQL arrays), {{CatalystSchemaConverter}} doesn't give exactly 
the same schema as before in compatible mode.

For a nullable int array, 1.4.x- sticks to parquet-avro and gives:
{noformat}
message root {
  <repetition> group <field-name> (LIST) {
    required int32 array;
  }
}
{noformat}
The inner most field name is {{array}}. However, {{CatalystSchemaConverter}} 
uses {{element}}.

Similarly, for a nullable int array, 1.4.x- sticks to parquet-hive, and gives:
{noformat}
optional group <field-name> (LIST) {
  optional group bag {
    optional int32 array_element;
  }
}
{noformat}
The inner most field name is {{array_element}}. However, 
{{CatalystSchemaConverter}} still uses {{element}}.

This issue doesn't affect Parquet read path since all the schemas above are 
covered by SPARK-6776. But this issue affects the write path. Ideally, we'd 
like Spark SQL 1.5.x to write Parquet files sticking to either exactly the same 
format used in 1.4.x- or the most recent Parquet format spec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-8928) CatalystSchemaConverter doesn't stick to behavior of old versions of Spark SQL when dealing with LISTs

Reply via email to