vinishjail97 opened a new pull request, #10336:
URL: https://github.com/apache/hudi/pull/10336
### Change Logs
Fix bug in checkpointing logic for S3/GCS in empty dataset use-case. The
reason for the bug was following.
1st delta commit's checkpoint, processed 3 files.
```
23/12/06 16:55:26 INFO S3EventsHoodieIncrSource
{database.table=keywords.ee_facts} : Querying S3 with:00000000000000,
queryInfo:Query information for Incremental Source queryType: snapshot,
previousInstant: 00000000000000, startInstant: 00000000000000, endInstant:
20231206150423946, orderColumn: _hoodie_commit_time, keyColumn: s3.object.key,
limitColumn: s3.object.size, orderByColumns: [_hoodie_commit_time,
s3.object.key]
23/12/06 16:55:42 INFO S3EventsHoodieIncrSource
{database.table=keywords.ee_facts} : Adjusting end checkpoint:20231206150423946
based on sourceLimit :300000000
23/12/06 16:55:46 INFO S3EventsHoodieIncrSource
{database.table=keywords.ee_facts} : Adjusted end checkpoint
:20231206150423946#ee-facts/0012_part_00.parquet
23/12/06 16:55:49 INFO S3EventsHoodieIncrSource
{database.table=keywords.ee_facts} : Total number of files to process :3
```
2nd delta commit was an empty one and the checkpoint returned was
20231206150423946 which is not a valid checkpoint progression because it should
either be equal or increase monotonically (based on lexicographical order)
```
23/12/06 16:59:52 INFO S3EventsHoodieIncrSource
{database.table=keywords.ee_facts} : Querying S3
with:20231206150423946#ee-facts/0012_part_00.parquet, queryInfo:Query
information for Incremental Source queryType: incremental, previousInstant:
00000000000000, startInstant: 20231206150423946, endInstant: 20231206150423946,
orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, limitColumn:
s3.object.size, orderByColumns: [_hoodie_commit_time, s3.object.key]
23/12/06 16:59:53 INFO S3EventsHoodieIncrSource
{database.table=keywords.ee_facts} : Adjusting end checkpoint:20231206150423946
based on sourceLimit :300000000
23/12/06 16:59:55 INFO S3EventsHoodieIncrSource
{database.table=keywords.ee_facts} : Empty source, returning
endpoint:20231206150423946
```
As the previous commits' checkpoint was a faulty one, the 3rd commit read
the same set of files again and wrote duplicate data.
### Impact
_Describe any public API or user-facing feature change or any performance
impact._
### Risk level (write none, low medium or high below)
Medium
### Documentation Update
_Describe any necessary documentation update if there is any new feature,
config, or user-facing change_
None, this is a bug fix for an existing feature.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]