chrigehr commented on issue #14619: URL: https://github.com/apache/iceberg/issues/14619#issuecomment-3784978202
I have the same problem as described by the author of this issue. I tested this with Spark 3.5.7 as well as with Spark 4.0.1. Calling `rewrite_position_delete_files` will fail on tables with lists or maps. As far as I understand (from debugging and analyzing code) I see the following problem: - [PositionDeletesRowReader.open](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java#L93) gets `nonConstantFieldIds` (in my understanding that are partition columns and similar constant fields that are know from metadata before opening the file. Thus this will contain essentially all data columns. - An expression is built with the method `ExpressionUtil.extractByIdInclusive` getting all the field ids for all data columns - This method creates a PartitionSpec based on all the given ids and will try to build a projection based on these ids. (see [code](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java#L160) ) - Problem is, that Iceberg does not allow fields within lists or maps as partitioning columns. In older Iceberg versions this use of `extractByIdInclusive` was no problem, because the validations in PartitionSpec did not check for this. But Iceberg > 1.10.x now has this [check](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L673) I'm not sure how to fix this. For an non-iceberg developer it seems, as if the `ExpressionUtil.extractByIdInclusive` is kind of "misused" for reading Position Delete Files. Possible ideas are: - to remove the new validations in PartitionSpec, but actually the validations seems to be correct - filter out fields within lists and maps within the method `PositionDeletesRowReader.nonConstantFieldIds` (https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java#L113) How can we proceed with this bug? At least for me this is a big problem because I rely on a working rewrite_position_delete_files procedure. Without this I see large performance degradation because multiple positition delete files per partition accumulate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
