westonpace commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-1421342033

   This sounds very similar to nested parallelism deadlocks we have had in the 
past.
   
   Outermost call: fork-join on a bunch of items (in this case it looks like we 
are doing fork-join on files)
   Inner task: fork-join on something else (e.g. in parquet it would be parquet 
column decoding)
   
   If the inner-task is blocking on a "join" then it is wasting a thread pool 
thread.  If enough of these thread pool threads get wasted then all thread pool 
threads are blocked waiting for other thread pool tasks and no thread is free 
to actually do the tasks.
   
   The solution we adopted was to migrate to an async model so that "join" step 
becomes "return a future" instead of "block until done".  This yields roughly 
the following rules:
   
    * The user thread (the python top-level thread) should block on a top-level 
future
    * CPU threads should never block (outside of minor blocking on mutex guards 
to sequence a tiny critical section)
    * I/O threads should only block on OS calls.  They should never block 
waiting for other tasks.
   
   It seems like the copy_files/s3 combination is violating one of the above 
rules.  There is an OptionalParallelFor in CopyFiles which blocks but I think 
that is called from the user thread and so that is ok.  @EpsilonPrime if you 
can reproduce I would grab a thread dump from gdb and check and see what the 
thread tasks are blocking on.  The fix will probably be to move copy files over 
to using async APIs (internally).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to