hudi-bot opened a new issue, #17366:
URL: https://github.com/apache/hudi/issues/17366

   Right now reading the partition columns of bootstrapped table in Spark works 
at the HoodieFileGroupReaderBasedParquetFileFormat, not at the file group 
reader layer.  Specifically, when directly using file group reader to read a 
file slice by merging bootstrap data and skeleton files, the partition column 
values are null; only for the record keys with updates in the log records 
(where the partition columns are read out directly through Hudi log reader), 
the partition column values are correct.
   
   Currently, HoodieFileGroupReaderBasedParquetFileFormat has an additional 
logic of projection to append the partition values on top of the record 
iterator returned by the file group reader:
   {code:java}
   // Append partition values to rows and project to output schema
                 appendPartitionAndProject(
                   reader.getClosableIterator,
                   requestedSchema,
                   remainingPartitionSchema,
                   outputSchema,
                   fileSliceMapping.getPartitionValues,
                   fixedPartitionIndexes) {code}
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-8896
   - Type: Sub-task
   - Parent: https://issues.apache.org/jira/browse/HUDI-9108
   - Fix version(s):
     - 1.1.0
   
   
   ---
   
   
   ## Comments
   
   31/Jan/25 18:17;yihua;A few tests in 
[https://github.com/apache/hudi/pull/12490] failed because of this, as the 
compaction directly using the file group reader on reading bootstrapped file 
slice does not write the partition column value properly.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to