RussellSpitzer edited a comment on issue #1735:
URL: https://github.com/apache/iceberg/issues/1735#issuecomment-724053798


   A more complete view.
   
   We call pruneColumns with ```{0 : status,1 : snapshot, 3: 
sequence_number}``` as our selected ID's on a second pass (still not sure why 
we hit this twice or why we request these 3 columns seems like this is a bug 
too)
   
   These selectedIds are passed in with the fileSchema correctly but there is a 
slightly strange behavior within the pruning code
   
   
https://github.com/apache/iceberg/blob/d1ba7b62abdad6b9fd8f3ec98f789ca53e9cf7b4/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L97-L98
   
   This filter means that any "empty" schema will automatically not be removed. 
If empty schemas are never removed then any top level schema element that 
contains an empty sub schema element will be saved. This happens when a table 
is unpartitioned since the PartitionType struct subfield of the file schema is 
empty.
   
   When the table is unpartitioned the sub schema of "partitionType" is empty, 
this means filteredFields.size == record.getFields.size() SO the "data_file" 
field is not pruned out. This means when we go back up the stack and attempt to 
resolve the top level schema we get a call to "record" with the last parameter 
"fields" set to 
   
   ```
   null,
   null,
   
{"type":"record","name":"r2","fields":[{"name":"partition","type":{"type":"record","name":"r102","fields":[]},"field-id":102}]},
   ```
   
   This saves us in the "unpartitioned" pruning case because then when we get 
down to the check
   
   
   
https://github.com/apache/iceberg/blob/d1ba7b62abdad6b9fd8f3ec98f789ca53e9cf7b4/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L80-L86
   
   Fields.get(2) has a field schema  Even though we didn't ask for "partition", 
the Line 97 check only says we cannot ever prune an empty struct. Since we 
can't prune the empty "partitionStruct" we end up having to keep data_file even 
though it is not on our list of selected Ids.
   
   
   **TLDR** This is broken because we ask for the wrong field IDs but is 
unbroken only for unpartitioned tables because we also incorrectly can never 
prune an empty struct
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to