[GitHub] [arrow] lidavidm opened a new pull request #12099: ARROW-15265: [C++] Fix hang in dataset writer with kDeleteMatchingPartitions and #partitions >= 8

GitBox Fri, 07 Jan 2022 10:41:49 -0800


lidavidm opened a new pull request #12099:
URL: https://github.com/apache/arrow/pull/12099



   When the dataset writer is configured to delete existing data before 
writing, the target directory is on S3, the dataset is partitioned, and there 
are at least as many partitions as threads in the I/O thread pool, then the 
writer would hang. The writer spawns a task on the I/O thread pool for each 
partition to delete existing data. However, S3FS implemented the relevant 
filesystem call by asynchronously listing the objects using the I/O thread 
pool, then deleting them, blocking until this is done. Hence, nested asynchrony 
would cause the program to hang.
   
   The fix is to do this deletion fully asynchronously, so that there is no 
blocking. It's sufficient to just use the default implementation of async 
filesystem methods; it just spawns another task on the I/O thread pool, but 
this lets the writer avoid blocking. However, this PR also refactors the S3FS 
internals to implement the call truly asynchronously.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm opened a new pull request #12099: ARROW-15265: [C++] Fix hang in dataset writer with kDeleteMatchingPartitions and #partitions >= 8

Reply via email to