calvin-pietersen opened a new issue, #5871: URL: https://github.com/apache/iceberg/issues/5871
### Query engine EMR 6.7.0 Spark 3.2.1 Iceberg 0.14.0 ### Question Hi, We are performing row level deletion on an Iceberg table using Spark SQL. Rows need to be deleted across multiple partitions/files. `delete from data as d where exists ( select id from deletes where d.id = id )` When deletion runs, our EMR cluster is running out of disk space if a high number of partitions/files are hit. On further inspection, Iceberg is performing a final repartition and sort just before replacing the data. My question is, why does Iceberg need to repartition and sort the files? Should the files not already be partitioned/sorted? <img width="450" alt="image" src="https://user-images.githubusercontent.com/16835507/192633685-f15fcc5b-b38c-4735-8257-91c9af3b7419.png"> The deletes table seems to be joined onto the data table via broadcast join. <img width="941" alt="image" src="https://user-images.githubusercontent.com/16835507/192634589-4c193747-1e63-4867-911d-e422ab0c2b72.png"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
