[GitHub] [spark] sunchao commented on a change in pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

GitBox Tue, 21 Sep 2021 10:20:05 -0700


sunchao commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r713258330




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -598,12 +598,20 @@ private[parquet] class ParquetRowConverter(
       //
       //      ARRAY<STRUCT<element: STRUCT<element: INT>>>
       //
+      //
       // Here we try to convert field `list` into a Catalyst type to see 
whether the converted type
-      // matches the Catalyst array element type. If it doesn't match, then 
it's case 1; otherwise,
-      // it's case 2.
+      // matches the Catalyst array element type.
+      //
+      // If the guessed element type from the above does not match the 
Catalyst type (for example,
+      // in case of schema evolution), we need to check if the repeated type 
matches one of the
+      // backward-compatibility rules for legacy LIST types (see the link 
above).
+      //
+      // If the element type does not match the Catalyst type and the 
underlying repeated type
+      // does not belong to the legacy LIST type, then it is case 1; 
otherwise, it is case 2.
       val guessedElementType = schemaConverter.convertField(repeatedType)
+      val isLegacy = schemaConverter.isElementType(repeatedType, 
parquetSchema.getName())

Review comment:
       I see, the existing `schemaConverter.convertField(repeatedType)` already 
covered the legacy format lists but this particular issue is about schema 
evolution with added new struct fields. I wonder whether it's better to just 
expand `equalsIgnoreCompatibleNullability` and allow `element` to _contain_ 
`guessedElementType`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Reply via email to