Re: [I] Error while executing "rewrite_position_delete_files" [iceberg]


chrigehr commented on issue #14619:
URL: https://github.com/apache/iceberg/issues/14619#issuecomment-3784978202

I have the same problem as described by the author of this issue. I tested
this with Spark 3.5.7 as well as with Spark 4.0.1.

Calling `rewrite_position_delete_files` will fail on tables with lists or
maps.

As far as I understand (from debugging and analyzing code) I see the
following problem:

-
[PositionDeletesRowReader.open](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java#L93)
gets `nonConstantFieldIds` (in my understanding that are partition columns and
similar constant fields that are know from metadata before opening the file.
Thus this will contain essentially all data columns.
- An expression is built with the method
`ExpressionUtil.extractByIdInclusive` getting all the field ids for all data
columns
- This method creates a PartitionSpec based on all the given ids and will
try to build a projection based on these ids. (see
[code](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java#L160)
)
- Problem is, that Iceberg does not allow fields within lists or maps as
partitioning columns. In older Iceberg versions this use of
`extractByIdInclusive` was no problem, because the validations in PartitionSpec
did not check for this. But Iceberg > 1.10.x now has this
[check](https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L673)

I'm not sure how to fix this. For an non-iceberg developer it seems, as if
the `ExpressionUtil.extractByIdInclusive` is kind of "misused" for reading
Position Delete Files. Possible ideas are:
- to remove the new validations in PartitionSpec, but actually the
validations seems to be correct
- filter out fields within lists and maps within the method
`PositionDeletesRowReader.nonConstantFieldIds`
(https://github.com/apache/iceberg/blob/15a72dc829844a4ca2f004139c9acc5f1a922578/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java#L113)

How can we proceed with this bug? At least for me this is a big problem
because I rely on a working rewrite_position_delete_files procedure. Without
this I see large performance degradation because multiple positition delete
files per partition accumulate.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to