kinolaev opened a new pull request, #15713:
URL: https://github.com/apache/iceberg/pull/15713

   This PR prevents deadlocks while filtering manifest entries. The problem is 
`ManifestFilterManager.filterManifest` in some cases reads each manifest twice: 
first in `manifestHasDeletedFiles` and then in 
`filterManifestWithDeletedFiles`. If `manifestHasDeletedFiles` returns in the 
middle of entries iterable, the underlying connection is open until the 
ManifestReader is closed. `filterManifest` method is called for all manifests 
in parallel. When number of simultaneous connections is limited (for example by 
http-client.apache.max-connections) it can lead to a deadlock because all 
connections are held by `manifestHasDeletedFiles`.
   
   The problem can be reproduced using spark-sql with S3FileIO and 
`http-client.apache.max-connections=1`:
   ```sql
   create table manifestfiltermanager(id bigint)
     partitioned by (truncate(1, id))
     tblproperties('write.delete.mode'='merge-on-read');
   -- create a data manifest with 100 entries
   insert into manifestfiltermanager select id from range(100);
   -- create a delete manifest with 50 entries
   delete from manifestfiltermanager where id in (select id from range(0, 100, 
2));
   -- make delete files dangling (fails without 
https://github.com/apache/iceberg/pull/15712)
   call system.rewrite_data_files('manifestfiltermanager', options => 
map('rewrite-all', 'true'));
   -- ManifestFilterManager.manifestHasDeletedFiles reads first block of the 
delete manifest
   -- and returns true before the end of liveEntries() iterable without closing 
it.
   -- ManifestFilterManager.filterManifestWithDeletedFiles fails with 
ConnectionPoolTimeoutException
   call system.rewrite_data_files('manifestfiltermanager', options => 
map('rewrite-all', 'true'));
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to