zhengzengprc commented on a change in pull request #3039:
URL: https://github.com/apache/iceberg/pull/3039#discussion_r700647474



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java
##########
@@ -53,4 +53,7 @@ private SparkReadOptions() {
 
   // skip snapshots of type delete while reading stream out of iceberg table
   public static final String STREAMING_SKIP_DELETE_SNAPSHOTS = 
"streaming-skip-delete-snapshots";
+
+  // Timestamp in milliseconds; start a stream from the snapshot that occurs 
after this timestamp
+  public static final String STREAM_FROM_TIMESTAMP = "stream-from-timestamp";

Review comment:
       > A few nits / questions here:
   > 
   > 1. Are there any concerns with the users timezone or anything? I'm still 
wrapping my head around all of the code for streaming as source with Spark, so 
forgive me if this is covered already. But the raw timestamp I'm assuming is 
supposed to be in millis from the epoch in UTC or can users configure their 
setup so that table snapshots have non-UTC timestamps? I've always used UTC so 
this might not actually be a concern.
   > 2. Looking at the `oldestSnapshot` code, it seems like it's grabbing at 
the first snapshot it can find that is older than the timestamp passed in. Is 
the phrase `from the snapshot that occurs after this timestamp` correct or 
possibly a little ambiguous? I might just need more coffee, but I would have 
expected `after this timestamp` to mean we grab the first snapshot that 
occurred later than this value.
   > 3. This might be something to handle in another PR, but should we consider 
allowing for timestamps as something other than strings representing timestamps 
in millis? Like either as timestamp formatted strings, and not just millis, or 
even as `NOW() - INTERVAL '5 days'` kind of thing, which would likely need to 
be added as a spark resolution rule.
   
   1. Yes, the timestamp is supposed to be in millis from the epoch in UTC. I 
am not aware that if user could configure their non-UTC timestamps. if we wanna 
support this scenario with user timezone. I think either we need a function to 
convert the user timestamp into UTC timestamp, or we need to pass in the user 
timezone information then converting it later. But I don't think we support 
other timezones other than UTC today.
   2. IMO, AS_OF_TIMESTAMP is ">=",  STREAM_FROM_TIMESTAMP is ">". But as the 
discussion above, I don't find a strong reason that STREAM_FROM_TIMESTAMP 
should be strictly greater. This I need @SreeramGarlapati to confirm, if 
STREAM_FROM_TIMESTAMP is NOT strictly greater, I think we can use the current 
option AS_OF_TIMESTAMP. @kbendick Your understand is totally right we grab the 
first snapshot that occurred later than this value, but I am not sure it should 
be inclusive the start timestamp or not.
   3. We can consider this some future enhancement in future, and this function 
currently is not in our radar. But I would say it is a very good suggestion. 
That we need another layer to translate NOW() - INTERVAL '5 days' to UTC 
timestamp




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to