RussellSpitzer edited a comment on issue #1735:
URL: https://github.com/apache/iceberg/issues/1735#issuecomment-724053798
A more complete view.
We call pruneColumns with ```{0 : status,1 : snapshot, 3:
sequence_number}``` as our selected ID's on a second pass (still not sure why
we hit this twice or why we request these 3 columns seems like this is a bug
too)
These selectedIds are passed in with the fileSchema correctly but there is a
slightly strange behavior within the pruning code
https://github.com/apache/iceberg/blob/d1ba7b62abdad6b9fd8f3ec98f789ca53e9cf7b4/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L97-L98
This filter means that any "empty" schema will automatically not be removed.
If empty schemas are never removed then any top level schema element that
contains an empty sub schema element will be saved. This happens when a table
is unpartitioned since the PartitionType struct subfield of the file schema is
empty.
When the table is unpartitioned the sub schema of "partitionType" is empty,
this means filteredFields.size == record.getFields.size() SO the "data_file"
field is not pruned out. This means when we go back up the stack and attempt to
resolve the top level schema we get a call to "record" with the last parameter
"fields" set to
```
null,
null,
{"type":"record","name":"r2","fields":[{"name":"partition","type":{"type":"record","name":"r102","fields":[]},"field-id":102}]},
```
This saves us in the "unpartitioned" pruning case because then when we get
down to the check
https://github.com/apache/iceberg/blob/d1ba7b62abdad6b9fd8f3ec98f789ca53e9cf7b4/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L80-L86
Fields.get(2) has a field schema Even though we didn't ask for "partition",
the Line 97 check only says we cannot ever prune an empty struct. Since we
can't prune the empty "partitionStruct" we end up having to keep data_file even
though it is not on our list of selected Ids.
**TLDR** This is broken because we ask for the wrong field IDs but is
unbroken only for unpartitioned tables because we also incorrectly can never
prune an empty struct
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]