[
https://issues.apache.org/jira/browse/SPARK-8928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-8928:
-----------------------------------
Assignee: Apache Spark (was: Cheng Lian)
> CatalystSchemaConverter doesn't stick to behavior of old versions of Spark
> SQL when dealing with LISTs
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-8928
> URL: https://issues.apache.org/jira/browse/SPARK-8928
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0
> Reporter: Cheng Lian
> Assignee: Apache Spark
>
> Spark SQL 1.4.x and prior versions doesn't follow Parquet format spec when
> dealing with complex types (because the spec wasn't clear about this part at
> the time Spark SQL Parquet support was authored). SPARK-6777 partly fixes
> this problem with {{CatalystSchemaConverter}} and introduces a feature flag
> to indicate whether we should stick to Parquet format spec (standard mode) or
> older Spark versions behavior (compatible mode). However, when dealing with
> LISTs (i.e. Spark SQL arrays), {{CatalystSchemaConverter}} doesn't give
> exactly the same schema as before in compatible mode.
> For a nullable int array, 1.4.x- sticks to parquet-avro and gives:
> {noformat}
> message root {
> <repetition> group <field-name> (LIST) {
> required int32 array;
> }
> }
> {noformat}
> The inner most field name is {{array}}. However, {{CatalystSchemaConverter}}
> uses {{element}}.
> Similarly, for a nullable int array, 1.4.x- sticks to parquet-hive, and gives:
> {noformat}
> optional group <field-name> (LIST) {
> optional group bag {
> optional int32 array_element;
> }
> }
> {noformat}
> The inner most field name is {{array_element}}. However,
> {{CatalystSchemaConverter}} still uses {{element}}.
> This issue doesn't affect Parquet read path since all the schemas above are
> covered by SPARK-6776. But this issue affects the write path. Ideally, we'd
> like Spark SQL 1.5.x to write Parquet files sticking to either exactly the
> same format used in 1.4.x- or the most recent Parquet format spec.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]