westonpace commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-1421342033
This sounds very similar to nested parallelism deadlocks we have had in the
past.
Outermost call: fork-join on a bunch of items (in this case it looks like we
are doing fork-join on files)
Inner task: fork-join on something else (e.g. in parquet it would be parquet
column decoding)
If the inner-task is blocking on a "join" then it is wasting a thread pool
thread. If enough of these thread pool threads get wasted then all thread pool
threads are blocked waiting for other thread pool tasks and no thread is free
to actually do the tasks.
The solution we adopted was to migrate to an async model so that "join" step
becomes "return a future" instead of "block until done". This yields roughly
the following rules:
* The user thread (the python top-level thread) should block on a top-level
future
* CPU threads should never block (outside of minor blocking on mutex guards
to sequence a tiny critical section)
* I/O threads should only block on OS calls. They should never block
waiting for other tasks.
It seems like the copy_files/s3 combination is violating one of the above
rules. There is an OptionalParallelFor in CopyFiles which blocks but I think
that is called from the user thread and so that is ok. @EpsilonPrime if you
can reproduce I would grab a thread dump from gdb and check and see what the
thread tasks are blocking on. The fix will probably be to move copy files over
to using async APIs (internally).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]