lidavidm opened a new pull request #12099: URL: https://github.com/apache/arrow/pull/12099
When the dataset writer is configured to delete existing data before writing, the target directory is on S3, the dataset is partitioned, and there are at least as many partitions as threads in the I/O thread pool, then the writer would hang. The writer spawns a task on the I/O thread pool for each partition to delete existing data. However, S3FS implemented the relevant filesystem call by asynchronously listing the objects using the I/O thread pool, then deleting them, blocking until this is done. Hence, nested asynchrony would cause the program to hang. The fix is to do this deletion fully asynchronously, so that there is no blocking. It's sufficient to just use the default implementation of async filesystem methods; it just spawns another task on the I/O thread pool, but this lets the writer avoid blocking. However, this PR also refactors the S3FS internals to implement the call truly asynchronously. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org