Re: [PR] Spark: preload delete files to avoid deadlocks [iceberg]

via GitHub Mon, 23 Mar 2026 17:32:12 -0700


kinolaev commented on PR #15712:
URL: https://github.com/apache/iceberg/pull/15712#issuecomment-4114594486


   I see the issue in practice. I guess data files are loaded in parallel and 
at least in case of S3FileIO the connection pool is shared. I'm sure that two 
connections are not enough. I've just added two more inserts and deletes
   ```sql
   create table sparkdeletefilter(id bigint)
     tblproperties('write.delete.mode'='merge-on-read');
   insert into sparkdeletefilter select id from range(0, 2);
   insert into sparkdeletefilter select id from range(2, 4);
   insert into sparkdeletefilter select id from range(4, 6);
   delete from sparkdeletefilter where id in (select id from range(0, 2, 2));
   delete from sparkdeletefilter where id in (select id from range(2, 4, 2));
   delete from sparkdeletefilter where id in (select id from range(4, 6, 2));
   select count(id) from sparkdeletefilter;
   ```
   and about half the time I get ConnectionPoolTimeoutException with 
`http-client.apache.max-connections=2`. The more operations I add the more 
often it fails. It's a race condition I suppose. With the PR applied there is 
no exception.
   I don't know exactly how many data and delete files we need to lock 50 
connections but I guess it's less then 1000.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: preload delete files to avoid deadlocks [iceberg]

Reply via email to