codope commented on code in PR #10218:
URL: https://github.com/apache/hudi/pull/10218#discussion_r1410419696
##########
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##########
@@ -159,6 +166,50 @@ public Option<String> getCompletionTime(String startTime) {
// the instant is still pending
return Option.empty();
}
+ loadCompletionTimeIncrementally(startTime);
+ return
Option.ofNullable(this.startToCompletionInstantTimeMap.get(startTime));
+ }
+
+ /**
+ * Queries the instant start time with given completion time range.
+ *
+ * <p>By default, assumes there is at most 1 day time of duration for an
instant to accelerate the queries.
+ *
+ * @param startCompletionTime The start completion time.
+ * @param endCompletionTime The end completion time.
+ *
+ * @return The instant time set.
+ */
+ public Set<String> getStartTimeSet(String startCompletionTime, String
endCompletionTime) {
+ // assumes any instant/transaction lasts at most 1 day to optimize the
query efficiency.
Review Comment:
Typically it's a safe assumption as most commits last from a few minutes to
an hour or so. But in some cases, when pipeline is blocked, commit can remain
pending for longer duration. Should this be an internal config?
##########
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##########
@@ -159,6 +166,50 @@ public Option<String> getCompletionTime(String startTime) {
// the instant is still pending
return Option.empty();
}
+ loadCompletionTimeIncrementally(startTime);
+ return
Option.ofNullable(this.startToCompletionInstantTimeMap.get(startTime));
+ }
+
+ /**
+ * Queries the instant start time with given completion time range.
+ *
+ * <p>By default, assumes there is at most 1 day time of duration for an
instant to accelerate the queries.
+ *
+ * @param startCompletionTime The start completion time.
+ * @param endCompletionTime The end completion time.
+ *
+ * @return The instant time set.
+ */
+ public Set<String> getStartTimeSet(String startCompletionTime, String
endCompletionTime) {
+ // assumes any instant/transaction lasts at most 1 day to optimize the
query efficiency.
+ return getStartTimeSet(startCompletionTime, endCompletionTime, s ->
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+ }
+
+ /**
+ * Queries the instant start time with given completion time range.
+ *
+ * @param startCompletionTime The start completion time.
+ * @param endCompletionTime The end completion time.
+ *
+ * @return The instant time set.
+ */
+ public Set<String> getStartTimeSet(String startCompletionTime, String
endCompletionTime, Function<String, String> earliestStartTimeFunc) {
+ String startInstant = earliestStartTimeFunc.apply(startCompletionTime);
+ final InstantRange instantRange = InstantRange.builder()
+ .rangeType(InstantRange.RangeType.CLOSE_CLOSE)
Review Comment:
not related to this PR - `RangeType` naming as `OPEN`, `CLOSED`,
`LEFT_OPEN`, `RIGHT_OPEN` sounds more canonical. If you agree, feel free to
fire another PR.
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/execution/benchmark/LSMTimelineReadBenchmark.scala:
##########
@@ -42,8 +42,9 @@ object LSMTimelineReadBenchmark extends HoodieBenchmarkBase {
* Apple M2
* pref load archived instants: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
*
------------------------------------------------------------------------------------------------------------------------
- * read shim instants 18 32
15 0.1 17914.8 1.0X
- * read instants with commit metadata 19 25
5 0.1 19403.1 0.9X
+ * read slim instants 494 521
27 0.5 1899.6 1.0X
+ * read instants with commit metadata 2544 2625
116 0.1 9785.9 0.2X
+ * read start time 156 177
26 1.7 601.1 3.2X
Review Comment:
Just for my understanding, why is reading timeline with start time 3.2x
slower? I did expect it to be a little slower but 3.2x sound a big difference.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]