yihua opened a new pull request, #7517:
URL: https://github.com/apache/hudi/pull/7517

   ### Change Logs
   
   When a write transaction writes uncommitted log files in a delta commit, 
e.g., due to Spark task retries, these log files stay in the file system after 
the successful delta commit for some time (unlike uncommitted base files, which 
are deleted based on the markers).  The delta commit metadata does not contain 
these log files, and the metadata table does not contain these entries either.  
This is a valid case where the metadata-table-based file listing (providing 
committed data files) is different from the file system (providing committed 
data files + uncommited log files in this case).
   
   In such a case, currently, the metadata table validator throws an exception 
for the mismatch, because the log blocks are checked based on the commit time, 
not validated against the commit metadata.
   
   This PR fixes the logic of the metadata table validator to check whether the 
difference in the list of log files between metadata table and direct file 
system is due to committed log files, based on the commit metadata.
   
   ### Impact
   
   This PR improves the robustness of the metadata table validator so that it 
does not fire false alarms for the valid case above.
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to