[GitHub] [iceberg] calvin-pietersen opened a new issue, #5871: Running out of disk space on row level deletes

GitBox Tue, 27 Sep 2022 14:03:15 -0700


calvin-pietersen opened a new issue, #5871:
URL: https://github.com/apache/iceberg/issues/5871

### Query engine

EMR 6.7.0
Spark 3.2.1
Iceberg 0.14.0

### Question

Hi,

We are performing row level deletion on an Iceberg table using Spark SQL.
Rows need to be deleted across multiple partitions/files.

`delete from data as d where exists ( select id from deletes where d.id = id
)`

When deletion runs, our EMR cluster is running out of disk space if a high
number of partitions/files are hit. On further inspection, Iceberg is
performing a final repartition and sort just before replacing the data. My
question is, why does Iceberg need to repartition and sort the files? Should
the files not already be partitioned/sorted?

The deletes table seems to be joined onto the data table via broadcast join.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] calvin-pietersen opened a new issue, #5871: Running out of disk space on row level deletes

Reply via email to