calvin-pietersen opened a new issue, #5871:
URL: https://github.com/apache/iceberg/issues/5871

   ### Query engine
   
   EMR 6.7.0
   Spark 3.2.1
   Iceberg 0.14.0
   
   ### Question
   
   Hi,
   
   We are performing row level deletion on an Iceberg table using Spark SQL. 
Rows need to be deleted across multiple partitions/files. 
   
   `delete from data as d where exists ( select id from deletes where d.id = id 
)`
   
   When deletion runs, our EMR cluster is running out of disk space if a high 
number of partitions/files are hit. On further inspection, Iceberg is 
performing a final repartition and sort just before replacing the data. My 
question is, why does Iceberg need to repartition and sort the files? Should 
the files not already be partitioned/sorted?
   
   <img width="450" alt="image" 
src="https://user-images.githubusercontent.com/16835507/192633685-f15fcc5b-b38c-4735-8257-91c9af3b7419.png";>
   
   The deletes table seems to be joined onto the data table via broadcast join.
   
   <img width="941" alt="image" 
src="https://user-images.githubusercontent.com/16835507/192634589-4c193747-1e63-4867-911d-e422ab0c2b72.png";>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to