voonhous commented on issue #17968: URL: https://github.com/apache/hudi/issues/17968#issuecomment-3777567187
Root cause is in: [org.apache.hudi.avro.HoodieAvroUtils#recordNeedsRewriteForExtendedAvroTypePromotion](https://github.com/apache/hudi/blob/51e02991201798a98e991c36d8e0204ebccfa8e7/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java#L1366-L1384) Line 1376, when schema fields to compare have different sizes, it returns true, causing a `HoodieAvroParquetReaderIterator` to be built in line 208 below instead of a `ParquetReaderIterator`. https://github.com/apache/hudi/blob/51e02991201798a98e991c36d8e0204ebccfa8e7/hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/hadoop/HoodieAvroParquetReader.java#L184-L213 `HoodieAvroParquetReaderIterator` will perform a rewrite when the `iterator#next` is invoked by it: https://github.com/apache/hudi/blob/51e02991201798a98e991c36d8e0204ebccfa8e7/hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HoodieAvroParquetReaderIterator.java#L41-L43 This happens regardless of whether the field _isNullable={true,false}_. The only reason why this fails when _isNullable=false_ is due to `HoodieAvroParquetReaderIterator` trying to perform a rewrite the record into the following following `promotedSchema` when iterating the `skeletonFile`, i.e. file that only contains the hoodie meta columns. `promotedSchema`: <details> ``` { "type" : "record", "name" : "spark_schema", "fields" : [ { "name" : "_hoodie_commit_time", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_commit_seqno", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_record_key", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_partition_path", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_file_name", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "timestamp", "type" : [ "null", "long" ], "default" : null }, { "name" : "_row_key", "type" : [ "null", "string" ], "default" : null }, { "name" : "partition_path", "type" : [ "null", "string" ], "default" : null }, { "name" : "rider", "type" : [ "null", "string" ], "default" : null }, { "name" : "driver", "type" : [ "null", "string" ], "default" : null }, { "name" : "begin_lat", "type" : [ "null", "double" ], "default" : null }, { "name" : "begin_lon", "type" : [ "null", "double" ], "default" : null }, { "name" : "end_lat", "type" : [ "null", "double" ], "default" : null }, { "name" : "end_lon", "type" : [ "null", "double" ], "default" : null }, { "name" : "fare", "type" : [ "null", { "type" : "record", "name" : "fare", "fields" : [ { "name" : "amount", "type" : [ "null", "double" ], "default" : null }, { "name" : "currency", "type" : [ "null", "string" ], "default" : null } ] } ], "default" : null }, { "name" : "tip_history", "type" : [ "null", { "type" : "array", "items" : [ "null", { "type" : "record", "name" : "element", "fields" : [ { "name" : "amount", "type" : [ "null", "double" ], "default" : null }, { "name" : "currency", "type" : [ "null", "string" ], "default" : null } ] } ] } ], "default" : null }, { "name" : "_hoodie_is_deleted", "type" : [ "null", "boolean" ], "default" : null } ] } ``` </details> When it's current file schema is actually: <details> ```json [{ "name" : "_hoodie_commit_time", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_commit_seqno", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_record_key", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_partition_path", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_file_name", "type" : [ "null", "string" ], "doc" : "", "default" : null }] ``` </details> As can be seen, when the `promotedSchema` is nullable, promoting the `skeletonFile`'s data to `promotedSchema` works, as we can just place nulls into the fields. But when there are non-nullable fields, this falls apart and an error is thrown. The crux of this issue is, there is no need to perform a rewrite when reading `skeletonFile`s as it's always the same fixed 5 columns, and performing rewrites is an unnecessary waste of CPU cycles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
