[GitHub] [iceberg] rdblue commented on a change in pull request #1508: Use schema at the time of the snapshot when reading a snapshot.

GitBox Fri, 01 Oct 2021 16:46:59 -0700


rdblue commented on a change in pull request #1508:
URL: https://github.com/apache/iceberg/pull/1508#discussion_r720590375




##########
File path: 
spark3/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java
##########
@@ -101,24 +103,35 @@ public Table getTable(StructType schema, Transform[] 
partitioning, Map<String, S
     SparkSession spark = SparkSession.active();
     setupDefaultSparkCatalog(spark);
     String path = options.get("path");
+    Long snapshotId = Spark3Util.propertyAsLong(options, 
SparkReadOptions.SNAPSHOT_ID, null);
+    Long asOfTimestamp = Spark3Util.propertyAsLong(options, 
SparkReadOptions.AS_OF_TIMESTAMP, null);

Review comment:
       Great, thanks for looking into it more! I've been considering other 
options to solve this problem and I think we probably will need to go with 
something similar to what you have here.
   
   The output attributes problem that you found makes sense. If there are 
columns in the snapshot's schema that have been dropped, then Spark doesn't 
have attributes for those columns and can't project them. We could try to 
expose the deleted columns as metadata columns in 3.1 and later, but we would 
still have a problem with `*` expansion because that uses the initial set of 
columns reported by the table and not metadata columns. So `select *` wouldn't 
behave the same way when time traveling if we took the approach with metadata 
columns.
   
   We could also try to solve this on the Spark side, but I don't think that 
it's unreasonable for Spark to do what it is here and expect that the table 
reports a schema initially that can be used for query analysis.
   
   The option that's left is loading the correct snapshot of the table so I 
agree with this direction. I'm weighing whether this should be done like 
metadata tables in the Iceberg layer or above it in Spark like you have here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1508: Use schema at the time of the snapshot when reading a snapshot.

Reply via email to