[
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679719#comment-14679719
]
Cheng Lian commented on SPARK-9340:
-----------------------------------
Would like to add that, parquet-avro can be from the perspective of read path,
Spark 1.5 and parquet-avro 1.7.0+ are two of the most standard Parquet data
model implementations which implement all backwards-compatibility rules defined
in parquet-format spec, and are able to read legacy Parquet data generated by
various systems. IIRC, no other libraries are capable to do so for now.
As for write path, parquet-avro 1.8.1+ is the only library I know that writes
fully standard Parquet data according the most recent parquet-format spec.
Spark SQL is also refactoring its own Parquet write path in
https://github.com/apache/spark/pull/7679, but it probably won't be part of
Spark 1.5.
In general, we can categorize existing Parquet libraries into 3 categories:
# Libraries that don't write data in standard Parquet format, and don't
implement backwards-compatibility rules
Using these libraries will probably hit similar issues you suffered when
reading Parquet data with LIST and MAP generated by other systems.
parquet-protobuf 1.4.3 and Spark SQL (<= 1.4) are both in this category.
That's why they don't play well with each other
# Libraries that don't write data in standard Parquet format, but implement
backwards-compatibility rules
Using these libraries, you'll be able to read legacy Parquet data generated
by various systems. Although they don't write standard Parquet data, the format
they use are also covered by the backwards-compatibility rules, so it's still
fine.
Spark 1.5-SNAPSHOT and parquet-avro 1.7.0 are the only two that I know of in
this category up until now.
# Libraries that write standard Parquet data, and implement
backwards-compatibility rules
They are the best citizens among all Parquet libraries. Unfortunately,
parquet-avro 1.8.1 is the only one that I know of up until now. Hopefully
Spark 1.6 will join this category.
As a summary, some rules to be aware of if you want true Parquet
interoperability:
- parquet-avro is usually the most standard Parquet data model out there,
partly because there are Parquet committers who are also Avro committers (e.g.
Ryan Blue).
- Always use libraries of category 3, or at least category 2 whenver possible.
- If you have to use libraries of category 1, don't use them to read Parquet
data generated by other libraries.
I'll leave this issue open for now until you confirm that 1.5 solves your
problem. If it doesn't, let's investigate further. But I'm removing 1.5.0 from
"Affects Version/s" field.
> ParquetTypeConverter incorrectly handling of repeated types results in schema
> mismatch
> --------------------------------------------------------------------------------------
>
> Key: SPARK-9340
> URL: https://issues.apache.org/jira/browse/SPARK-9340
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
> Reporter: Damian Guy
> Attachments: ParquetTypesConverterTest.scala
>
>
> The way ParquetTypesConverter handles primitive repeated types results in an
> incompatible schema being used for querying data. For example, given a schema
> like so:
> message root {
> repeated int32 repeated_field;
> }
> Spark produces a read schema like:
> message root {
> optional int32 repeated_field;
> }
> These are incompatible and all attempts to read fail.
> In ParquetTypesConverter.toDataType:
> if (parquetType.isPrimitive) {
> toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString,
> isInt96AsTimestamp)
> } else {...}
> The if condition should also have
> !parquetType.isRepetition(Repetition.REPEATED)
>
> And then this case will need to be handled in the else
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]