rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r680278607
##########
File path:
spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -186,28 +186,29 @@ protected Schema tableSchema() {
}
private CloseableIterable<InternalRow> newDataIterable(DataTask task, Schema
readSchema) {
- StructInternalRow row = new StructInternalRow(tableSchema.asStruct());
+ Schema taskSchema = task.schema() == null ? tableSchema : task.schema();
+ StructInternalRow row = new StructInternalRow(taskSchema.asStruct());
Review comment:
I think I see what's going on. For the entries table, Spark will push
the projection into the scan and because we are reading manifests as the data
files, we actually apply that projection when reading in the data task (the
`data_file` schema is passed into each `ManifestReadTask`).
In theory, we should be able to use `expectedSchema` here instead of
`tableSchema` to handle this because the expected schema should match the
schema that gets pushed down by Spark. But in practice there are two problems:
1. `ManifestReader` will only accept a file projection because it needs to
return live entries and so it always projects all fields of `manifest_entry`
2. Some tables use this Spark projection to avoid needing to project rows in
Iceberg. For example, rows in the history table are never projected because we
didn't want to implement a projection in Iceberg when it was built.
I see how this is a reasonable work-around, but I think we should fix some
of the debt instead of moving ahead with it. We should make sure that tasks
produce the `expectedSchema` instead of trying to figure out what schema the
task produces.
I would solve this by using `StructProjection` to project rows in tables
like the history table that return full rows. And I would also use it to prune
out the additional top-level fields of `manifest_entry`. I think if you do
that, then there will be no need to add a task-specific schema. *And*, we
should be able to remove the Spark projection here, which exists because of the
history table setup. Now that we have an Iceberg projection there is no need to
continue doing that.
Does that make sense?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]