[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

GitBox Fri, 30 Jul 2021 17:33:59 -0700


rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r680278607




##########
File path: 
spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -186,28 +186,29 @@ protected Schema tableSchema() {
   }
 
   private CloseableIterable<InternalRow> newDataIterable(DataTask task, Schema 
readSchema) {
-    StructInternalRow row = new StructInternalRow(tableSchema.asStruct());
+    Schema taskSchema = task.schema() == null ? tableSchema : task.schema();
+    StructInternalRow row = new StructInternalRow(taskSchema.asStruct());

Review comment:
       I think I see what's going on. For the entries table, Spark will push 
the projection into the scan and because we are reading manifests as the data 
files, we actually apply that projection when reading in the data task (the 
`data_file` schema is passed into each `ManifestReadTask`).
   
   In theory, we should be able to use `expectedSchema` here instead of 
`tableSchema` to handle this because the expected schema should match the 
schema that gets pushed down by Spark. But in practice there are two problems:
   1. `ManifestReader` will only accept a file projection because it needs to 
return live entries and so it always projects all fields of `manifest_entry`
   2. Some tables use this Spark projection to avoid needing to project rows in 
Iceberg. For example, rows in the history table are never projected because we 
didn't want to implement a projection in Iceberg when it was built.
   
   I see how this is a reasonable work-around, but I think we should fix some 
of the debt instead of moving ahead with it. We should make sure that tasks 
produce the `expectedSchema` instead of trying to figure out what schema the 
task produces.
   
   I would solve this by using `StructProjection` to project rows in tables 
like the history table that return full rows. And I would also use it to prune 
out the additional top-level fields of `manifest_entry`. I think if you do 
that, then there will be no need to add a task-specific schema. *And*, we 
should be able to remove the Spark projection here, which exists because of the 
history table setup. Now that we have an Iceberg projection there is no need to 
continue doing that.
   
   Does that make sense?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Reply via email to