nsivabalan opened a new pull request, #18826:
URL: https://github.com/apache/hudi/pull/18826

   ### Change Logs
   
   Fixes https://github.com/apache/hudi/issues/18825.
   
   During RLI bootstrap, 
`HoodieBackedTableMetadataWriter#initializeRecordIndexPartition`
   previously persisted the full materialized RDD of RLI records
   (`records.persist("MEMORY_AND_DISK_SER")`) and counted it
   (`records.count()`) purely to obtain the total record count used to size
   the RLI file groups. This is the primary latency and memory bottleneck of
   RLI bootstrap on large tables.
   
   This PR:
   
   - Replaces the persist+count of the RLI records with a direct row-count
     read from each base file's footer metadata via
     `FileFormatUtils.getRowCount(...)`. Footer reads are O(1) per file and
     avoid materializing the record dataset.
   - Reuses the file slices already collected for record-key reading; for
     MOR, base files are extracted from the file slices already in hand
     (estimation uses base file row counts only; log file deltas are bounded
     and not material for sizing).
   - Bypasses estimation entirely when the user pins the RLI file group
     count via `min == max`, and uses the configured value directly.
   - Adds `TestHoodieBackedMetadata#testRecordIndexFileGroupEstimation` and
     `testRecordIndexWithFixedFileGroupCount` covering both COW and MOR.
   
   ### Impact
   
   Decouples RLI file group sizing from materializing the record-keys RDD.
   Eliminates the persist+count pass during RLI bootstrap, which is the
   primary latency bottleneck on large tables.
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   No new configs or user-facing behavior changes.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to