kbendick commented on a change in pull request #3039:
URL: https://github.com/apache/iceberg/pull/3039#discussion_r700440029
##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -65,12 +65,16 @@ public static boolean ancestorOf(Table table, long
snapshotId, long ancestorSnap
}
/**
- * Traverses the history of the table's current snapshot and finds the
oldest Snapshot.
- * @return null if there is no current snapshot in the table, else the
oldest Snapshot.
+ * Traverses the history of the table's current snapshot and finds the
oldest Snapshot after the timestamp.
+ * @return null if there is no current snapshot in the table, else the
oldest Snapshot after the timestamp.
Review comment:
Ok. That does help clarify for me @zhengzengprc
I think I was expecting it to work slightly differently. I'd like to clarify
what is the exact intent / use case for choosing to grab the snapshot that is
strictly greater than the passed in timestamp?
And for your present use case, what is supposed to happen if a user passes
in a timestamp that exists exactly?
let's say we have snapshot with time series: 10,20,30,40,50
Current snapshot will be with timestamp 50.
The input timestamp is 30.
Will the oldestSnapshot still return with timestamp 40, or will the user get
timestamp 30?
For your proposed functionality of grabbing the timestamp just after that
passed in, I'm guessing it would return 4 still (as that's the newest snapshot
just after 31)
These are my two biggest outstanding questions / wondering why the
motivation for the feature the way that it is.
Some of the engines (or batch for spark) support `as-of-timestamp`, which I
believe should definitely grab 30 in the case I mentioned as that's the exact
timestamp. I believe it would _also_ grab 30 in your example of passing in 31
as the input timestamp, as the table's state as of 31 is what it was at
timestamp 30.
**Question**: What's the use case / motivation for grabbing the timestamp
just _after_ the one passed in, and not grabbing the table as of the timestamp
that is given (or starting from the snapshot that was the most recent as of the
timestamp - so grabbing 30 in your example of input 31 instead of grabbing 40)?
What problem does this solve that requires it to differ from the semantics of
`as-of-timestamp`? And could the problem that you're looking to solve be solved
by `as-of-timestamp`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]