[jira] [Commented] (SPARK-8928) CatalystSchemaConverter doesn't stick to behavior of old versions of Spark SQL when dealing with LISTs

Apache Spark (JIRA) Wed, 08 Jul 2015 16:49:51 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619603#comment-14619603
 ]


Apache Spark commented on SPARK-8928:
-------------------------------------

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/7304

> CatalystSchemaConverter doesn't stick to behavior of old versions of Spark 
> SQL when dealing with LISTs
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-8928
>                 URL: https://issues.apache.org/jira/browse/SPARK-8928
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>
> Spark SQL 1.4.x and prior versions doesn't follow Parquet format spec when 
> dealing with complex types (because the spec wasn't clear about this part at 
> the time Spark SQL Parquet support was authored). SPARK-6777 partly fixes 
> this problem with {{CatalystSchemaConverter}} and introduces a feature flag 
> to indicate whether we should stick to Parquet format spec (standard mode) or 
> older Spark versions behavior (compatible mode). However, when dealing with 
> LISTs (i.e. Spark SQL arrays), {{CatalystSchemaConverter}} doesn't give 
> exactly the same schema as before in compatible mode.
> For a nullable int array, 1.4.x- sticks to parquet-avro and gives:
> {noformat}
> message root {
>   <repetition> group <field-name> (LIST) {
>     required int32 array;
>   }
> }
> {noformat}
> The inner most field name is {{array}}. However, {{CatalystSchemaConverter}} 
> uses {{element}}.
> Similarly, for a nullable int array, 1.4.x- sticks to parquet-hive, and gives:
> {noformat}
> optional group <field-name> (LIST) {
>   optional group bag {
>     optional int32 array_element;
>   }
> }
> {noformat}
> The inner most field name is {{array_element}}. However, 
> {{CatalystSchemaConverter}} still uses {{element}}.
> This issue doesn't affect Parquet read path since all the schemas above are 
> covered by SPARK-6776. But this issue affects the write path. Ideally, we'd 
> like Spark SQL 1.5.x to write Parquet files sticking to either exactly the 
> same format used in 1.4.x- or the most recent Parquet format spec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-8928) CatalystSchemaConverter doesn't stick to behavior of old versions of Spark SQL when dealing with LISTs

Reply via email to