westonpace commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r804328939



##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -151,6 +151,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
   }
 
   void InitLocalStateIfNeeded(size_t thread_index) {
+    DCHECK_LT(thread_index, local_states_.size());
     ThreadLocalState& local_state = local_states_[thread_index];

Review comment:
       > Even with Sys.setenv(OMP_THREAD_LIMIT = "1") this still occurs.
   
   That isn't too surprising.  `use_threads` triggers an entirely different 
path in some places.  So it is not entirely equivalent to `OMP_THREAD_LIMIT = 
"1"`.
   
   > I also tried writing a C++ unit test that did a join after a dataset scan, 
but I couldn't reproduce the problem. That leads me to think there may be some 
issue with how the R bindings are configuring things, but it could also be I 
just didn't reproduce it quite well enough.
   
   How consistent is the R error?
   
   > Despite use_threads = FALSE, it seems like there are quite a few threads 
spawned by the engine. While I'm learning, I'm just not familiar enough to know 
which parts seem weird.
   
   `use_threads` generally does not control the I/O thread pool (which defaults 
to 8 threads and is not controlled by `OMP_THREAD_LIMIT`).  If someone was 
really passionate about shoving everything onto the calling thread then there 
is a way to do this but it would be quite a bit of work.
   
   In addition, jemalloc (if compiled in), will spawn some background cleanup 
threads.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to