prashantwason opened a new pull request, #18029:
URL: https://github.com/apache/hudi/pull/18029

   ### Describe the issue this Pull Request addresses
   
   Closes #18028
   
   When bootstrapping the record index, there is currently no validation to 
ensure that the expected number of records matches the actual number of records 
written to the metadata table. This can lead to silent data integrity issues.
   
   ### Summary and Changelog
   
   This PR adds validation for record index bootstrap by comparing the expected 
record count with the actual record count stored in the metadata table.
   
   **Changes:**
   - Added `validateRecordIndex` method in `HoodieBackedTableMetadataWriter` to 
validate record counts after bootstrap completes
   - Added `getTotalRecordIndexRecords` method in `HoodieBackedTableMetadata` 
to get total records from file slice base files
   - Updated `initializeFilegroupsAndCommitToRecordIndexPartition` to call 
validation after commit when duplicates are not allowed (controlled by 
`hoodie.hfile.writes.allow.duplicates` config)
   - Updated `initializeFilegroupsAndCommitToPartitionedRecordIndexPartition` 
to return file group count for validation
   
   ### Impact
   
   - Improves data integrity for record index bootstrap
   - Validation is enabled by default (when 
`hoodie.hfile.writes.allow.duplicates=false`)
   - If validation fails, the bootstrap will throw an exception with details 
about the mismatch
   
   ### Risk Level
   
   low - This adds validation logic that runs after the existing bootstrap 
completes. It does not change the bootstrap behavior itself, only adds a 
post-commit verification step.
   
   ### Documentation Update
   
   none - No new configs are added. The validation uses the existing 
`hoodie.hfile.writes.allow.duplicates` config.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to