michalheld commented on code in PR #15193:
URL: https://github.com/apache/iceberg/pull/15193#discussion_r2905697077
##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java:
##########
@@ -95,7 +95,12 @@ public Long snapshotId() {
}
public Long asOfTimestamp() {
- return
confParser.longConf().option(SparkReadOptions.AS_OF_TIMESTAMP).parseOptional();
+ SparkConfParser.LongConfParser longParser =
+ confParser.longConf().option(SparkReadOptions.AS_OF_TIMESTAMP);
+ if (snapshotId() == null && branch == null && tag() == null) {
+ longParser.sessionConf(SparkSQLProperties.AS_OF_TIMESTAMP);
+ }
+ return longParser.parseOptional();
Review Comment:
The idea was that explicit tag/branch/snapshot/as-of-timestamp — whether as
a table property, table ref, or in SQL — should take precedence over the
session-level time-travel property, which would be silently ignored. Examples
of this can be found in
org.apache.iceberg.spark.sql.TestSelect#testSessionPropertyWithMultiTableJoin.
Other test cases also show that if a specific tag or branch is defined for a
table, the session property is ignored and data is returned from the tag or
branch snapshot. This seems like reasonable behaviour to me, but I'm open to
other opinions.
I tried removing these conditions to understand how behaviour would change.
It turns out this method currently always returns null, because as-of-timestamp
is always pre-resolved to snapshot-id and stripped from the options before
SparkReadConf is constructed. Removing the guard causes a conflict between
snapshot-id and as-of-timestamp whenever snapshot, tag, branch, or
as-of-timestamp are specified at the table level — either in SQL or via the
DataFrame API.
This leaves us with three options:
1. Keep the current conditions and document that the session property is
only applied when no table-level time-travel is set, and silently ignored
otherwise.
2. Change the implementation so that a table-level as-of-timestamp takes
precedence over the session property, while still failing if the session
property is combined with a branch or tag reference. This would require a much
more complex implementation, moving the resolution logic into the places where
the as-of-timestamp is currently pre-resolved to snapshot-id.
3. Fail the query when the session property is set, and the query also
specifies a table-level snapshot, as-of-timestamp, branch, or tag. However,
removing the conditions as-is would always produce the error "Cannot set both
snapshot-id and as-of-timestamp to select which table snapshot to scan", since
all table-level time-travel options pre-resolve to snapshot-id, which can be a
confusing message in this context.
I personally prefer option 1.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]