[GitHub] [iceberg] kbendick commented on a change in pull request #3039: Introduce spark3 option to read stream from a timestamp

GitBox Wed, 01 Sep 2021 10:55:00 -0700


kbendick commented on a change in pull request #3039:
URL: https://github.com/apache/iceberg/pull/3039#discussion_r700440029




##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -65,12 +65,16 @@ public static boolean ancestorOf(Table table, long 
snapshotId, long ancestorSnap
   }
 
   /**
-   * Traverses the history of the table's current snapshot and finds the 
oldest Snapshot.
-   * @return null if there is no current snapshot in the table, else the 
oldest Snapshot.
+   * Traverses the history of the table's current snapshot and finds the 
oldest Snapshot after the timestamp.
+   * @return null if there is no current snapshot in the table, else the 
oldest Snapshot after the timestamp.

Review comment:
       Ok. That does help clarify for me @zhengzengprc
   
   I think I was expecting it to work slightly differently. I'd like to clarify 
what is the exact intent / use case for choosing to grab the snapshot that is 
strictly greater than the passed in timestamp?
   
   And for your present use case, what is supposed to happen if a user passes 
in a timestamp that exists exactly?
   
   let's say we have snapshot with time series: 10,20,30,40,50
   Current snapshot will be with timestamp 50.
   The input timestamp is 30.
   Will the oldestSnapshot still return with timestamp 40, or will the user get 
timestamp 30?
   
   For your proposed functionality of grabbing the timestamp just after that 
passed in, I'm guessing it would return 4 still (as that's the newest snapshot 
just after 31)
   
   These are my two biggest outstanding questions / wondering why the 
motivation for the feature the way that it is.
   
   Some of the engines (or batch for spark) support `as-of-timestamp`, which I 
believe should definitely grab 30 in the case I mentioned as that's the exact 
timestamp. I believe it would _also_ grab 30 in your example of passing in 31 
as the input timestamp, as the table's state as of 31 is what it was at 
timestamp 30.
   
   **Question**: What's the use case / motivation for grabbing the timestamp 
just _after_ the one passed in, and not grabbing the table as of the timestamp 
that is given (or starting from the snapshot that was the most recent as of the 
timestamp - so grabbing 30 in your example of input 31 instead of grabbing 40)? 
What problem does this solve that requires it to differ from the semantics of 
`as-of-timestamp`? And could the problem that you're looking to solve be solved 
by `as-of-timestamp`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #3039: Introduce spark3 option to read stream from a timestamp

Reply via email to