voonhous commented on issue #17968:
URL: https://github.com/apache/hudi/issues/17968#issuecomment-3777567187

   Root cause is in:
   
   
[org.apache.hudi.avro.HoodieAvroUtils#recordNeedsRewriteForExtendedAvroTypePromotion](https://github.com/apache/hudi/blob/51e02991201798a98e991c36d8e0204ebccfa8e7/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java#L1366-L1384)
   
   Line 1376, when schema fields to compare have different sizes, it returns 
true, causing a `HoodieAvroParquetReaderIterator` to be built in line 208 below 
instead of a `ParquetReaderIterator`.
   
   
https://github.com/apache/hudi/blob/51e02991201798a98e991c36d8e0204ebccfa8e7/hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/hadoop/HoodieAvroParquetReader.java#L184-L213
   
   
   `HoodieAvroParquetReaderIterator` will perform a rewrite when the 
`iterator#next` is invoked by it:
   
   
https://github.com/apache/hudi/blob/51e02991201798a98e991c36d8e0204ebccfa8e7/hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HoodieAvroParquetReaderIterator.java#L41-L43
   
   This happens regardless of whether the field _isNullable={true,false}_. The 
only reason why this fails when _isNullable=false_ is due to 
`HoodieAvroParquetReaderIterator` trying to perform a rewrite the record into 
the following following `promotedSchema` when iterating the `skeletonFile`, 
i.e. file that only contains the hoodie meta columns.
   
   `promotedSchema`:
   <details>
   ```
   {
     "type" : "record",
     "name" : "spark_schema",
     "fields" : [ {
       "name" : "_hoodie_commit_time",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_commit_seqno",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_record_key",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_partition_path",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_file_name",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "timestamp",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "_row_key",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "partition_path",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "rider",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "driver",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "begin_lat",
       "type" : [ "null", "double" ],
       "default" : null
     }, {
       "name" : "begin_lon",
       "type" : [ "null", "double" ],
       "default" : null
     }, {
       "name" : "end_lat",
       "type" : [ "null", "double" ],
       "default" : null
     }, {
       "name" : "end_lon",
       "type" : [ "null", "double" ],
       "default" : null
     }, {
       "name" : "fare",
       "type" : [ "null", {
         "type" : "record",
         "name" : "fare",
         "fields" : [ {
           "name" : "amount",
           "type" : [ "null", "double" ],
           "default" : null
         }, {
           "name" : "currency",
           "type" : [ "null", "string" ],
           "default" : null
         } ]
       } ],
       "default" : null
     }, {
       "name" : "tip_history",
       "type" : [ "null", {
         "type" : "array",
         "items" : [ "null", {
           "type" : "record",
           "name" : "element",
           "fields" : [ {
             "name" : "amount",
             "type" : [ "null", "double" ],
             "default" : null
           }, {
             "name" : "currency",
             "type" : [ "null", "string" ],
             "default" : null
           } ]
         } ]
       } ],
       "default" : null
     }, {
       "name" : "_hoodie_is_deleted",
       "type" : [ "null", "boolean" ],
       "default" : null
     } ]
   }
   ```
   </details> 
   
   When it's current file schema is actually: 
   <details>
   ```json
   [{
       "name" : "_hoodie_commit_time",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_commit_seqno",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_record_key",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_partition_path",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_file_name",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }]
   ```
   </details>
   
   As can be seen, when the `promotedSchema` is nullable, promoting the 
`skeletonFile`'s data to `promotedSchema` works, as we can just place nulls 
into the fields.
   
   But when there are non-nullable fields, this falls apart and an error is 
thrown.
   
   The crux of this issue is, there is no need to perform a rewrite when 
reading `skeletonFile`s as it's always the same fixed 5 columns, and performing 
rewrites is an unnecessary waste of CPU cycles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to