chrigehr commented on issue #14619:
URL: https://github.com/apache/iceberg/issues/14619#issuecomment-3784978202

   I have the same problem as described by the author of this issue. I tested 
this with Spark 3.5.7 as well as with Spark 4.0.1.
   
   Calling `rewrite_position_delete_files` will fail on tables with lists or 
maps.
   
   As far as I understand (from debugging and analyzing code) I see the 
following problem:
   
   - 
[PositionDeletesRowReader.open](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java#L93)
 gets `nonConstantFieldIds` (in my understanding that are partition columns and 
similar constant fields that are know from metadata before opening the file. 
Thus this will contain essentially all data columns.
   - An expression is built with the method 
`ExpressionUtil.extractByIdInclusive` getting all the field ids for all data 
columns
   - This method creates a PartitionSpec based on all the given ids and will 
try to build a projection based on these ids. (see 
[code](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java#L160)
 )
   - Problem is, that Iceberg does not allow fields within lists or maps as 
partitioning columns. In older Iceberg versions this use of 
`extractByIdInclusive` was no problem, because the validations in PartitionSpec 
did not check for this. But Iceberg > 1.10.x now has this 
[check](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L673)
   
   I'm not sure how to fix this. For an non-iceberg developer it seems, as if 
the `ExpressionUtil.extractByIdInclusive` is kind of "misused" for reading 
Position Delete Files. Possible ideas are:
   - to remove the new validations in PartitionSpec, but actually the 
validations seems to be correct
   - filter out fields within lists and maps within the method 
`PositionDeletesRowReader.nonConstantFieldIds` 
(https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java#L113)
 
   
   
   How can we proceed with this bug? At least for me this is a big problem 
because I rely on a working rewrite_position_delete_files procedure. Without 
this I see large performance degradation because multiple positition delete 
files per partition accumulate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to