yihua opened a new pull request, #7517: URL: https://github.com/apache/hudi/pull/7517
### Change Logs When a write transaction writes uncommitted log files in a delta commit, e.g., due to Spark task retries, these log files stay in the file system after the successful delta commit for some time (unlike uncommitted base files, which are deleted based on the markers). The delta commit metadata does not contain these log files, and the metadata table does not contain these entries either. This is a valid case where the metadata-table-based file listing (providing committed data files) is different from the file system (providing committed data files + uncommited log files in this case). In such a case, currently, the metadata table validator throws an exception for the mismatch, because the log blocks are checked based on the commit time, not validated against the commit metadata. This PR fixes the logic of the metadata table validator to check whether the difference in the list of log files between metadata table and direct file system is due to committed log files, based on the commit metadata. ### Impact This PR improves the robustness of the metadata table validator so that it does not fire false alarms for the valid case above. ### Risk level low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
