vinishjail97 opened a new pull request, #10336:
URL: https://github.com/apache/hudi/pull/10336

   ### Change Logs
   
   Fix bug in checkpointing logic for S3/GCS in empty dataset use-case. The 
reason for the bug was following. 
   
   1st  delta commit's checkpoint, processed 3 files.
   ```
   23/12/06 16:55:26 INFO S3EventsHoodieIncrSource 
{database.table=keywords.ee_facts} : Querying S3 with:00000000000000, 
queryInfo:Query information for Incremental Source queryType: snapshot, 
previousInstant: 00000000000000, startInstant: 00000000000000, endInstant: 
20231206150423946, orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, 
limitColumn: s3.object.size, orderByColumns: [_hoodie_commit_time, 
s3.object.key]
        
   
   23/12/06 16:55:42 INFO S3EventsHoodieIncrSource 
{database.table=keywords.ee_facts} : Adjusting end checkpoint:20231206150423946 
based on sourceLimit :300000000
        
   23/12/06 16:55:46 INFO S3EventsHoodieIncrSource 
{database.table=keywords.ee_facts} : Adjusted end checkpoint 
:20231206150423946#ee-facts/0012_part_00.parquet
        
   
   23/12/06 16:55:49 INFO S3EventsHoodieIncrSource 
{database.table=keywords.ee_facts} : Total number of files to process :3
   ```
   
   2nd delta commit was an empty one and the checkpoint returned was 
20231206150423946 which is not a valid checkpoint progression because it should 
either be equal or increase monotonically (based on lexicographical order)
   ```
   23/12/06 16:59:52 INFO S3EventsHoodieIncrSource 
{database.table=keywords.ee_facts} : Querying S3 
with:20231206150423946#ee-facts/0012_part_00.parquet, queryInfo:Query 
information for Incremental Source queryType: incremental, previousInstant: 
00000000000000, startInstant: 20231206150423946, endInstant: 20231206150423946, 
orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, limitColumn: 
s3.object.size, orderByColumns: [_hoodie_commit_time, s3.object.key]
        
   23/12/06 16:59:53 INFO S3EventsHoodieIncrSource 
{database.table=keywords.ee_facts} : Adjusting end checkpoint:20231206150423946 
based on sourceLimit :300000000
        
   23/12/06 16:59:55 INFO S3EventsHoodieIncrSource 
{database.table=keywords.ee_facts} : Empty source, returning 
endpoint:20231206150423946
   ```
   
   As the previous commits' checkpoint was a faulty one, the 3rd commit read 
the same set of files again and wrote duplicate data.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   Medium
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   None, this is a bug fix for an existing feature.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to