xuzifu666 commented on PR #10898:
URL: https://github.com/apache/hudi/pull/10898#issuecomment-2030921118
> @xuzifu666 @danny0405 @beyond1920 i think we should solve the root cause
of bucket duplication. There are currently three situations where bucket file
duplication occurs
>
> 1. Spark Speculation execution . Turn
off speculative execution , we can solve this problem
> 2. hoodier archiver Parallel deletet complete timeline .
1.0 has solved this problem.
> 3. Concurrent into ovewrite of multiple spark writer . this is a bug
need to fixed.
>
> now focus on the scence3: Concurrent into ovewrite of multiple spark
writer when hudi build fileslice, hudi will call isFileSliceCommitted to
Determine if the current file is committed.
>
> ```
> /**
> * A FileSlice is considered committed, if one of the following is true
- There is a committed data file - There are
> * some log files, that are based off a commit or delta commit.
> */
> private boolean isFileSliceCommitted(FileSlice slice) {
> if (!compareTimestamps(slice.getBaseInstantTime(),
LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
> return false;
> }
>
> return
timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
> }
> ```
>
> this is ok for single concurrent write scenario, but for mutil write the
logical of isFileSliceCommitted has some bugs. If a file has a smaller commit
time then smallest complete commit, Hudi will directly determine that the file
is committed, even if it is a Garbage file or (File generated by write failure)
>
> eg: two spark app insert overwrite hudi BUCKET table with same partition.
app1: start write commit at 00000001 write files: 000000000--uuid1.parquet
app2: start write commit at 00000002 write files: 000000000--uuid2.parquet app1
maybe failed to write due to OCC /cancel/OOM, but 000000000--uuid1.parquet is
already written. when hudi build fileslice, 000000000--uuid1.parquet is
considered as committed. since it‘s committime 00000001 < smallest complete
commit 00000002. this is wrong, committime 00000001 is not committed maybe we
can modify isFileSliceCommitted like this
>
> ```
> private boolean isFileSliceCommitted(FileSlice slice) {
> if (!compareTimestamps(slice.getBaseInstantTime(),
LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
> return false;
> }
>
> return
timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime()) &&
UncompleteTimelineNotContains(slice.getBaseInstantTime());
> }
> ```
>
> finally, I think Hudi's fileslices should be managed uniformly, just like
iceberg/delta lakes, rather than being obtained through list operation.
Thanks for your advice,had test it in multiple write sences,it is ok as
expected @xiarixiaoyao
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]