RussellSpitzer commented on issue #1735:
URL: https://github.com/apache/iceberg/issues/1735#issuecomment-724053798
A more complete view.
We call pruneColumns with ```{0 : status,1 : snapshot, 3:
sequence_number}``` as our selected ID's on a second pass (still not sure why
we hit this twice or why we request these 3 columns seems like this is a bug
too)
These selectedIds are passed in with the fileSchema correctly but there is a
slightly strange behavior within the pruning code
https://github.com/apache/iceberg/blob/d1ba7b62abdad6b9fd8f3ec98f789ca53e9cf7b4/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L97-L98
This filter means that any "empty" schema will automatically not be removed.
If empty schemas are never removed then any top level schema element that
contains an empty sub schema element will be saved. This happens when a table
is unpartitioned since the PartitionType struct subfield of the file schema is
empty.
When the table is unpartitioned the sub schema of "partitionType" is empty,
this means filteredFields.size == record.getFields.size() SO the "data_file"
field is not pruned out. This means when we go back up the stack and attempt to
resolve the top level schema we get a call to "record" with the last parameter
"fields" set to
```
null,
null,
{"type":"record","name":"r2","fields":[{"name":"partition","type":{"type":"record","name":"r102","fields":[]},"field-id":102}]},
```
This saves us in the "unpartitioned" pruning case because then when we get
down to the check
https://github.com/apache/iceberg/blob/d1ba7b62abdad6b9fd8f3ec98f789ca53e9cf7b4/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L80-L86
Fields.get(2) is incorrectly "not null" but the incorrect field here allows
data_file to survive the pruning (although incorrectly)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]