xuzifu666 commented on PR #10898:
URL: https://github.com/apache/hudi/pull/10898#issuecomment-2030921118

   > @xuzifu666 @danny0405 @beyond1920 i think we should solve the root cause 
of bucket duplication. There are currently three situations where bucket file 
duplication occurs
   > 
   > 1. Spark Speculation execution                                     .  Turn 
off speculative execution , we can solve this problem
   > 2. hoodier archiver Parallel deletet  complete timeline              .  
1.0 has solved this problem.
   > 3. Concurrent into ovewrite of multiple spark writer      . this is a bug 
need to fixed.
   > 
   > now focus on the scence3: Concurrent into ovewrite of multiple spark 
writer when hudi build fileslice, hudi will call isFileSliceCommitted to 
Determine if the current file is committed.
   > 
   > ```
   >   /**
   >    * A FileSlice is considered committed, if one of the following is true 
- There is a committed data file - There are
   >    * some log files, that are based off a commit or delta commit.
   >    */
   >   private boolean isFileSliceCommitted(FileSlice slice) {
   >     if (!compareTimestamps(slice.getBaseInstantTime(), 
LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
   >       return false;
   >     }
   > 
   >     return 
timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
   >   }
   > ```
   > 
   > this is ok for single concurrent write scenario, but for mutil write the 
logical of isFileSliceCommitted has some bugs. If a file has a smaller commit 
time then smallest complete commit, Hudi will directly determine that the file 
is committed, even if it is a Garbage file or (File generated by write failure)
   > 
   > eg: two spark app insert overwrite hudi BUCKET table with same partition. 
app1: start write commit at 00000001 write files: 000000000--uuid1.parquet 
app2: start write commit at 00000002 write files: 000000000--uuid2.parquet app1 
maybe failed to write due to OCC /cancel/OOM, but 000000000--uuid1.parquet is 
already written. when hudi build fileslice, 000000000--uuid1.parquet is 
considered as committed. since it‘s committime 00000001 < smallest complete 
commit 00000002. this is wrong, committime 00000001 is not committed maybe we 
can modify isFileSliceCommitted like this
   > 
   > ```
   >    private boolean isFileSliceCommitted(FileSlice slice) {
   >     if (!compareTimestamps(slice.getBaseInstantTime(), 
LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
   >       return false;
   >     }
   > 
   >     return 
timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime()) && 
UncompleteTimelineNotContains(slice.getBaseInstantTime());
   >   }
   > ```
   > 
   > finally, I think Hudi's fileslices should be managed uniformly, just like 
iceberg/delta lakes, rather than being obtained through list operation.
   
   Thanks for your advice,had test it in multiple write sences,it is ok as 
expected @xiarixiaoyao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to