[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

Ryan Blue (JIRA) Mon, 10 Aug 2015 09:54:06 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680355#comment-14680355
 ]


Ryan Blue commented on SPARK-9340:
----------------------------------

Sorry to jump in late on this issue... I think you're on the right track here, 
but just to be sure I'll clarify things as I see them.

The specs written for PARQUET-113 allow non-LIST/MAP repeated fields because 
that's what parquet-protobuf uses. But, we didn't implement support for 
unannotated repeated groups because we wanted to address the compatibility 
issues between Hive, Thrift, and Avro as quickly as possible (which are still 
being cleaned up). So for now, unannotated repeated groups throw the 
AnalysisException noted above. Those should eventually map to required lists of 
required elements to give the exact same view of the data that you have in 
parquet-protobuf.

I believe [~damianguy], would like to discuss a different mapping from the 
protobuf schema to a parquet schema, which is a great discussion to have in the 
upstream Parquet project. That sounds like a reasonable extension to me, but I 
want to see what the protobuf model maintainers think of it.

> ParquetTypeConverter incorrectly handling of repeated types results in schema 
> mismatch
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-9340
>                 URL: https://issues.apache.org/jira/browse/SPARK-9340
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>            Reporter: Damian Guy
>         Attachments: ParquetTypesConverterTest.scala
>
>
> The way ParquetTypesConverter handles primitive repeated types results in an 
> incompatible schema being used for querying data. For example, given a schema 
> like so:
> message root {
>    repeated int32 repeated_field;
>  }
> Spark produces a read schema like:
> message root {
>    optional int32 repeated_field;
>  }
> These are incompatible and all attempts to read fail.
> In ParquetTypesConverter.toDataType:
>  if (parquetType.isPrimitive) {
>       toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
> isInt96AsTimestamp)
>     } else {...}
> The if condition should also have 
> !parquetType.isRepetition(Repetition.REPEATED)
>  
> And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

Reply via email to