prashantwason opened a new pull request, #18029: URL: https://github.com/apache/hudi/pull/18029
### Describe the issue this Pull Request addresses Closes #18028 When bootstrapping the record index, there is currently no validation to ensure that the expected number of records matches the actual number of records written to the metadata table. This can lead to silent data integrity issues. ### Summary and Changelog This PR adds validation for record index bootstrap by comparing the expected record count with the actual record count stored in the metadata table. **Changes:** - Added `validateRecordIndex` method in `HoodieBackedTableMetadataWriter` to validate record counts after bootstrap completes - Added `getTotalRecordIndexRecords` method in `HoodieBackedTableMetadata` to get total records from file slice base files - Updated `initializeFilegroupsAndCommitToRecordIndexPartition` to call validation after commit when duplicates are not allowed (controlled by `hoodie.hfile.writes.allow.duplicates` config) - Updated `initializeFilegroupsAndCommitToPartitionedRecordIndexPartition` to return file group count for validation ### Impact - Improves data integrity for record index bootstrap - Validation is enabled by default (when `hoodie.hfile.writes.allow.duplicates=false`) - If validation fails, the bootstrap will throw an exception with details about the mismatch ### Risk Level low - This adds validation logic that runs after the existing bootstrap completes. It does not change the bootstrap behavior itself, only adds a post-commit verification step. ### Documentation Update none - No new configs are added. The validation uses the existing `hoodie.hfile.writes.allow.duplicates` config. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
