[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

ASF GitHub Bot (Jira) Wed, 20 Apr 2022 05:04:11 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759150&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759150
 ]


ASF GitHub Bot logged work on HIVE-26151:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Apr/22 12:03
            Start Date: 20/Apr/22 12:03
    Worklog Time Spent: 10m 
      Work Description: marton-bod commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r854053748


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##########
@@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, 
Table table) {
   public static boolean isBucketed(Table table) {
     return table.spec().fields().stream().anyMatch(f -> 
f.transform().toString().startsWith("bucket["));
   }
+
+  /**
+   * Returns the snapshot ID which is immediately before (or exactly at) the 
timestamp provided in millis.
+   * If the timestamp provided is before the first snapshot of the table, we 
return an empty optional.
+   * If the timestamp provided is in the future compared to the latest 
snapshot, we return the latest snapshot ID.
+   *
+   * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 
respectively (T0 = start of epoch), then:
+   * - from T0 to T2 -> returns empty
+   * - from T3 to T5 -> returns S1
+   * - from T6 to T8 -> returns S2
+   * - from T9 to T∞ -> returns S3
+   *
+   * @param table the table whose snapshot ID we are trying to find
+   * @param time the timestamp provided in milliseconds
+   * @return the snapshot ID corresponding to the time
+   */
+  public static Optional<Long> findSnapshotForTimestamp(Table table, long 
time) {
+    if (table.history().get(0).timestampMillis() > time) {
+      return Optional.empty();
+    }
+
+    for (Snapshot snapshot : table.snapshots()) {

Review Comment:
   Looks like the snapshots are ordered by commit time. 
   Whenever there's a commit, we take the existing list of the snapshots 
[here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L817),
 and simply append the new snapshot to the end 
[here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L994).
 
   And since it's a List, the iteration order will be deterministic.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 759150)
    Time Spent: 1h 10m  (was: 1h)

> Support range-based time travel queries for Iceberg
> ---------------------------------------------------
>
>                 Key: HIVE-26151
>                 URL: https://issues.apache.org/jira/browse/HIVE-26151
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Marton Bod
>            Assignee: Marton Bod
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '<startTime>' TO '<endTime>'
> SELECT * FROM table FOR SYSTEM_VERSION FROM <startVersion> TO <endVersion>
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

Reply via email to