yihua commented on code in PR #11947:
URL: https://github.com/apache/hudi/pull/11947#discussion_r1805754603


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java:
##########
@@ -118,6 +124,32 @@ public QueryInfo getNextCheckpoint(Dataset<Row> df, 
QueryInfo queryInfo, Option<
         .orElse(queryInfo);
   }
 
+  public Option<CheckpointWithPredicates> getNextCheckpoint(Dataset<Row> df, 
QueryContext queryContext,
+                                                            
Option<SourceProfileSupplier> sourceProfileSupplier) {
+    // the start instant would be included into the final query result. So we 
need to get
+    // a strictly lower timestamp to have query splitter include the start 
instant
+    Option<CheckpointWithPredicates> nextCheckpointWithPredicates =
+        getNextCheckpointWithPredicates(df, 
instantTimeMinusMillis(queryContext.getBeginInstant().get(), 1));
+    if (nextCheckpointWithPredicates.isPresent()) {
+      // getNextCheckpointWithPredicates is based on instant times,
+      // so we need to translate the instant time to the completion time
+      String endInstantTime = 
nextCheckpointWithPredicates.get().getEndInstant();
+      Option<String> endCompletionTime = 
Option.fromJavaOptional(queryContext.getInstants().stream()
+          .filter(instant -> endInstantTime.equals(instant.getTimestamp()))

Review Comment:
   The end completion time or next checkpoint returned by the custom 
`SnapshotLoadQuerySplitter` implementation can be earlier than the max 
completion time from query context.  For example, the query context returns 
these instants with instant and completion time, `[(t10, t20), (t30, t40), 
(t50, t60)]` with the max completion time of `t60`, the custom 
`SnapshotLoadQuerySplitter` implementation may take and process this as input 
and only return `t30` as the next instant to stop (thus `t40` as the 
`endCompletionTime`) after custom logic of further limiting the instants to 
read.  So we need to account for such a logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to