danny0405 commented on a change in pull request #2553:
URL: https://github.com/apache/hudi/pull/2553#discussion_r572538220



##########
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteFunction.java
##########
@@ -58,33 +61,39 @@
  * <p><h2>Work Flow</h2>
  *
  * <p>The function firstly buffers the data as a batch of {@link 
HoodieRecord}s,
- * It flushes(write) the records batch when a Flink checkpoint starts. After a 
batch has been written successfully,
+ * It flushes(write) the records batch when a batch exceeds the configured 
size {@link FlinkOptions#WRITE_BATCH_SIZE}
+ * or a Flink checkpoint starts. After a batch has been written successfully,
  * the function notifies its operator coordinator {@link 
StreamWriteOperatorCoordinator} to mark a successful write.
  *
- * <p><h2>Exactly-once Semantics</h2>
+ * <p><h2>The Semantics</h2>
  *
  * <p>The task implements exactly-once semantics by buffering the data between 
checkpoints. The operator coordinator

Review comment:
       We do not block now when we start a checkpoint and triggers data write, 
that means, during the checkpoint data buffer flushing, new data may comes in 
and trigger write. That means a checkpoint may contains data more than it 
should keep.
   
   When there are checkpoints failure, it may roll back to a checkpoint that 
has more data written, and this data buffer duplicates.
   
   But we still have a ultimately consistent semantics based on the fact that 
every Hoodie records has a record key.
   
   The old pipeline is also not exactly-once semantics. We just found that 
there is no need to keep that for new pipeline to strength the throughput.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to