sadikovi commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r713508125
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -598,12 +598,20 @@ private[parquet] class ParquetRowConverter(
//
// ARRAY<STRUCT<element: STRUCT<element: INT>>>
//
+ //
// Here we try to convert field `list` into a Catalyst type to see
whether the converted type
- // matches the Catalyst array element type. If it doesn't match, then
it's case 1; otherwise,
- // it's case 2.
+ // matches the Catalyst array element type.
+ //
+ // If the guessed element type from the above does not match the
Catalyst type (for example,
+ // in case of schema evolution), we need to check if the repeated type
matches one of the
+ // backward-compatibility rules for legacy LIST types (see the link
above).
+ //
+ // If the element type does not match the Catalyst type and the
underlying repeated type
+ // does not belong to the legacy LIST type, then it is case 1;
otherwise, it is case 2.
val guessedElementType = schemaConverter.convertField(repeatedType)
+ val isLegacy = schemaConverter.isElementType(repeatedType,
parquetSchema.getName())
Review comment:
Yes, that is correct, legacy format would still be read by Spark, it was
schema evolution of a list element that could trigger this issue. If all of the
files have the same schema, everything should work just fine.
I considered having something like "contains" instead of "equals" but I had
a concern that this might introduce issues when the schema "contains" but it
should still be treated as a 3-level LIST. Also, I could not find "contains"
method for DataType in the codebase. IMHO, it is better to check parquet
compatibility issues using parquet schema rather Catalyst schema which was
meant to reconcile those types anyway.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]