michalursa commented on pull request #10845:
URL: https://github.com/apache/arrow/pull/10845#issuecomment-901366200


   I looked into the failure in unit tests and here is what is happening.
   
   InputReceived() method for semi-join exec node uses ThreadIndexer class to 
map current thread into an integer number between 0 and thread pool's 
Capacity() - 1. There is Capacity() number of local states that can be used by 
threads and therefore it is important that ThreadIndexer never returns any 
number greater than the mentioned limit. 
   
   ThreadIndexer adds one more index for every thread that uses it. Threads 
that use it in semi-join are the threads that call InputReceived() on its exec 
node. In the failing unit test the Capacity() is N but the number of threads 
calling InputReceived() is N+1. The N threads come from a thread pool passed to 
generators used in source array scan exec nodes for both build and probe side 
of the join. The extra 1 thread is a thread that executes the unit test and 
does not belong to the same thread pool.
   
   How is this possible? In the failing case, the exec plan uses default 
execution context which has thread pool pointer set to null. That means that 
the intention is to run the plan single-threaded, parallel execution should not 
have it set to null. Scan nodes are bound to generators and generators use 
thread pools given to them (independently of execution context for exec plan) 
to execute multi-threaded scan. Scan nodes are supposed to transfer tasks from 
generator thread pool to the exec plan thread pool in case of parallel scan. 
But in this case parallel scan is met with a serial plan, there is no thread 
pool to transfer to and therefore the original thread is used (I assume). We 
get a mix of parallel (from scan's generators) and single-threaded (from plan) 
execution logic. This situation is not supposed to happen - parallel scan 
should be used with parallel execution plan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to