Re: [PR] [HUDI-6129] Support rate limit for Spark streaming source [hudi]

via GitHub Thu, 14 Dec 2023 22:49:48 -0800


boneanxs commented on code in PR #10326:
URL: https://github.com/apache/hudi/pull/10326#discussion_r1427612669



##########
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java:
##########
@@ -246,6 +257,19 @@ public interface HoodieTimeline extends Serializable {
    */
   HoodieTimeline findInstantsInClosedRange(String startTs, String endTs);
 
+  /**
+   * Create a new Timeline with instants after or equals startTs and before or 
on endTs
+   * by completionTime.
+   */
+  HoodieTimeline findInstantsInClosedRangeByCompletionTime(String startTs, 
String endTs);
+

Review Comment:
   Rate limit on the `commit` level means on the `write operation` level(one 
write operation equals to one commit), However, this approach may not be 
sufficient when we need to handle incoming data that is divided based on each 
write operation. Let me provide two scenarios to illustrate this:
   
   1. Batch write -> HUDI -> Streaming read, batch job could be with more 
resources than the streaming job, there could be `insert` some data, 
`insert_overwrite` across many partitions. The streaming job could have highly 
risk to fail if we can only limit the data at commit level. We have to adjust 
large resources for this long running streaming job to avoid it.
   2. streaming write -> HUDI -> different downstream streaming read jobs. Each 
downstream job might have different resource limits, some jobs could have 
efficient resources allowing them to have large commits, but some's are 
not(they can tolerate slightly delayed data freshness), if we only keep at 
commit level, all downstream jobs must keep at least the same resource with the 
upstream streaming job.
   
   Is there any issue with reading partial files within a single commit? I'm 
considering this as a potential improvement that would offer users more choices



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-6129] Support rate limit for Spark streaming source [hudi]

Reply via email to