garyli1019 commented on a change in pull request #2929:
URL: https://github.com/apache/hudi/pull/2929#discussion_r629006340



##########
File path: 
hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java
##########
@@ -142,6 +142,22 @@
    */
   private transient TotalSizeTracer tracer;
 
+  /**
+   * Flag saying whether the write task is waiting for the checkpoint success 
notification
+   * after it finished a checkpoint.
+   *
+   * <p>The flag is needed because the write task does not block during the 
waiting time interval,
+   * some data buckets still flush out with old instant time. There are two 
cases that the flush may produce
+   * corrupted files if the old instant is committed successfully:
+   * 1) the write handle was writing data but interrupted, left a corrupted 
parquet file;
+   * 2) the write handle finished the write but was not closed, left an empty 
parquet file.
+   *
+   * <p>To solve, when this flag was set to true, we flush the data buffer 
with a new instant time = old instant time + 1ms,
+   * the new instant time would affect the write file name. The filesystem 
view does not recognize the file as committed because
+   * it always filters the data files based on successful commit time.

Review comment:
       Do you mean we write the corrupted file with an invalid instant 
time(+1ms) then retry?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to