[GitHub] [iceberg] wypoon commented on a change in pull request #1508: Use schema at the time of the snapshot when reading a snapshot.

GitBox Mon, 27 Sep 2021 21:33:34 -0700


wypoon commented on a change in pull request #1508:
URL: https://github.com/apache/iceberg/pull/1508#discussion_r717215294




##########
File path: 
spark3/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java
##########
@@ -101,24 +103,35 @@ public Table getTable(StructType schema, Transform[] 
partitioning, Map<String, S
     SparkSession spark = SparkSession.active();
     setupDefaultSparkCatalog(spark);
     String path = options.get("path");
+    Long snapshotId = Spark3Util.propertyAsLong(options, 
SparkReadOptions.SNAPSHOT_ID, null);
+    Long asOfTimestamp = Spark3Util.propertyAsLong(options, 
SparkReadOptions.AS_OF_TIMESTAMP, null);

Review comment:
       It turns out that you are mistaken.
   The 
[`DataSourceV2Relation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L131)
 is 
[created](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L175-L176)
 with the output attributes from the schema given by `SparkTable#schema()`.
   The `SparkTable` is loaded using the catalog and identifier, and that is why 
I need the `SnapshotAwareIdentifier` when loading it, so that I can return the 
snapshot schema in `SparkTable#schema()`.
   Otherwise the `SparkScanBuilder` does have the options for `snapshot-id` or 
`as-of-timestamp`, but its `pruneColumns()` will be called by Spark with a 
`requestedSchema` that is a subset of the table schema. Then its `build()` will 
return a `SparkBatchQueryScan` with an incorrect schema.
   Once I remove the modifications to `IcebergSource`, `SparkCatalog` and 
`SparkTable`, then the unit tests I added all fail, as I suspected.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a change in pull request #1508: Use schema at the time of the snapshot when reading a snapshot.

Reply via email to