MichaelChirico opened a new issue, #50239: URL: https://github.com/apache/arrow/issues/50239
### Describe the bug, including details regarding any error messages, version, and platform. We recently enabled [ThreadSanitizer](https://clang.llvm.org/docs/ThreadSanitizer.html) tests for R packages, and it surfaced some errors in the {arrow} suite, e.g. ``` WARNING: ThreadSanitizer: data race (pid=9303) Read of size 8 at 0x72900075c008 by thread T9: #0 INTEGER src/main/memory.c:4205:8 #1 arrow::r::Converter_Int<arrow::UInt8Type>::Ingest_some_nulls(SEXPREC*, std::shared_ptr<arrow::Array> const&, long, long, unsigned long) const r/src/array_to_vector.cpp:191:19 (arrow.so+0xf0337) #2 operator() r/src/array_to_vector.cpp:88:24 (arrow.so+0xe00a9) #3 arrow::internal::FnOnce<arrow::Status ()>::FnImpl<arrow::r::Converter::ScheduleConvertTasks(arrow::r::RTasks&, std::shared_ptr<arrow::r::Converter>)::'lambda'()>::invoke() cpp/src/arrow/util/functional.h:153:42 (arrow.so+0xe00a9) #4 operator() cpp/src/arrow/util/functional.h:141:17 #5 operator() cpp/src/arrow/util/task_group.cc:114:18 #6 arrow::internal::FnOnce<void ()>::FnImpl<arrow::internal::(anonymous namespace)::ThreadedTaskGroup::AppendReal(arrow::internal::FnOnce<arrow::Status ()>)::'lambda'()>::invoke() cpp/src/arrow/util/functional.h:153:42 #7 operator() cpp/src/arrow/util/functional.h:141:17 #8 WorkerLoop cpp/src/arrow/util/thread_pool.cc:457:11 #9 operator() cpp/src/arrow/util/thread_pool.cc:618:7 #10 __invoke<(lambda at cpp/src/arrow/util/thread_pool.cc:616:23)> #11 __thread_execute #12 void* std::__thread_proxy ``` I don't really have the knowledge of internals required to completely debug the issue, but Gemini found a fix and summarized it. As usual with LLMs, it looks good at a high level, if a bit wordy. I can also propose Gemini's edit as a PR, if so requested. --- ### Description During the conversion of Arrow Tables to R DataFrames (specifically when `use_threads = TRUE` is enabled), a data race can occur between the R main thread and background worker threads. This race is detected by ThreadSanitizer (TSAN) when worker threads access R vector headers (via `INTEGER()`, `REAL()`, etc.) concurrently with the main thread registering the vectors into the output list. ### Root Cause Analysis `to_data_frame` loops over columns and calls `Converter::LazyConvert` to schedule conversion tasks: https://github.com/apache/arrow/blob/c75b82dd594d60b056ec7539ec2b7e8ba2aaff13/r/src/array_to_vector.cpp#L1405-L1408 1. **Immediate Execution:** Currently, `RTasks::Append` immediately submits parallel tasks to the CPU thread pool when they are scheduled. These background tasks start running immediately and begin writing to the pre-allocated R vector (`out`) for their respective columns. 2. **Concurrent Access:** To write data, the background worker threads call R API accessors like `INTEGER(data)` (or `REAL`, `LOGICAL`), which read the vector header to check the type via `TYPEOF(x)` (accessing `x->sxpinfo.type`). 3. **Main Thread Writes:** Meanwhile, the main thread continues the loop. As soon as `LazyConvert` returns the allocated (but not yet fully populated) vector, the main thread assigns it to the list element: `tbl[i] = out`. This assignment calls `SET_VECTOR_ELT`, which modifies the vector's header metadata (e.g., updating reference counts or setting GC write barriers/generation flags). 4. **The Race:** The worker threads are reading the vector's `sxpinfo` header concurrently with the main thread writing to the same `sxpinfo` header (which is a bitfield sharing the same memory word), leading to a data race. ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
