zhengzengprc commented on a change in pull request #3039:
URL: https://github.com/apache/iceberg/pull/3039#discussion_r700647474
##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java
##########
@@ -53,4 +53,7 @@ private SparkReadOptions() {
// skip snapshots of type delete while reading stream out of iceberg table
public static final String STREAMING_SKIP_DELETE_SNAPSHOTS =
"streaming-skip-delete-snapshots";
+
+ // Timestamp in milliseconds; start a stream from the snapshot that occurs
after this timestamp
+ public static final String STREAM_FROM_TIMESTAMP = "stream-from-timestamp";
Review comment:
> A few nits / questions here:
>
> 1. Are there any concerns with the users timezone or anything? I'm still
wrapping my head around all of the code for streaming as source with Spark, so
forgive me if this is covered already. But the raw timestamp I'm assuming is
supposed to be in millis from the epoch in UTC or can users configure their
setup so that table snapshots have non-UTC timestamps? I've always used UTC so
this might not actually be a concern.
> 2. Looking at the `oldestSnapshot` code, it seems like it's grabbing at
the first snapshot it can find that is older than the timestamp passed in. Is
the phrase `from the snapshot that occurs after this timestamp` correct or
possibly a little ambiguous? I might just need more coffee, but I would have
expected `after this timestamp` to mean we grab the first snapshot that
occurred later than this value.
> 3. This might be something to handle in another PR, but should we consider
allowing for timestamps as something other than strings representing timestamps
in millis? Like either as timestamp formatted strings, and not just millis, or
even as `NOW() - INTERVAL '5 days'` kind of thing, which would likely need to
be added as a spark resolution rule.
1. Yes, the timestamp is supposed to be in millis from the epoch in UTC. I
am not aware that if use could configure their non-UTC timestamps. if we wanna
support this scenario. I think either we need a function to convert the user
timestamp into UTC timestamp, or we need to pass in the user timezone
information then converting it later. But I don't think we support other
timezone other than UTC today.
2. IMO, AS_OF_TIMESTAMP is ">=", STREAM_FROM_TIMESTAMP is ">". But as the
discussion above, I don't find a strong reason that STREAM_FROM_TIMESTAMP
should be strictly greater. This I need @SreeramGarlapati to confirm, if
STREAM_FROM_TIMESTAMP is NOT strictly greater, I think we can use the current
option AS_OF_TIMESTAMP. @kbendick Your understand is totally right we grab the
first snapshot that occurred later than this value, but I am not sure it should
be inclusive the start timestamp or not.
3. We can consider this some future enhancement in future, and this function
currently is not in our radar. But I would say it is a very good suggestion.
That we need another layer to translate NOW() - INTERVAL '5 days' to UTC
timestamp
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]