whocanhu commented on PR #10898:
URL: https://github.com/apache/hudi/pull/10898#issuecomment-2081438412
> > > > we also meet the issue in our tests, the case is that we just use
simple bucket index with mor without partition, and when we restart the job, it
will write success once, and but then the bucket id conflict.
00000074-3413-4e9e-b4dd-4676e4eeccb4-0_74-8-3151_20240426173957249.parquet
(bulkinsert generate it)
00000074-50bc-4b34-82c5-08c210d82d33-0_74-26-37004_20240428145515030.parquet
(deltacommit generate it after a restart) according to the driver log, the
restart job read a rollback commit, and seems the timeline not load all the
bucket completely 24/04/28 14:57:32 INFO HoodieBucketIndex: Get
BucketIndexLocationMapper for partitions: [] 24/04/28 14:57:32 INFO
HoodieActiveTimeline: Loaded instants upto :
Option{val=[20240428145515193__rollback__COMPLETED__20240428145515760]}
> > >
> > >
> > > Yes,it is a serious problem which would block user's business. would
rise it to find out error instant firstly which can help user to continue their
business. @danny0405
> >
> >
> > actually in my use case, no multi writer, just a restart triggered a
rollback, then met this issue.
>
> At lease get the error instant can help you to continue your job.
my use case is cdc, so should keep the data complete with the upstream db
table, and the deltacommit that introduce the conflict bucket deltacommit is
completed, so we couldn't just delete the old one and let the job continue, the
ideal way to get out the root cause and avoid the conflict bucket create,
according to my simple analysis, the cause is that the timeline reload has some
issue, cuz we used our internally consist hash for a long time, no this issue,
but bucket index, meet this issue, consist hash and bucket index all invoke the
getLatestFileSlicesForPartition, so i think maybe this one may caused by
reloadActiveTimeline when restart


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]