HyukjinKwon opened a new pull request, #48905:
URL: https://github.com/apache/arrow/pull/48905

   ### Rationale for this change
   
   The `DictionaryArray::dictionary()` method has a data race during lazy 
initialization. When multiple threads concurrently call `dictionary()`, the 
check then initialize pattern allows multiple threads to pass the check 
simultaneously, creating multiple `Array` objects instead of one shared cached 
instance.
   
   There were a bit of discussion in 
https://github.com/apache/arrow/pull/36418#discussion_r1248319000 but I just 
propose the standard approach with the keep it simple, stupid sprit.
   
   ### What changes are included in this PR?
   
   This PR makes `DictionaryArray::dictionary` thread-safe by `std::once_flag` 
and `std::call_once`.
   
   ### Are these changes tested?
   
   Manually tested with the example below:
   
   ```c++
   #include <arrow/api.h>
   #include <thread>
   #include <vector>
   
   int main() {
     arrow::StringBuilder dict_builder;
     dict_builder.Append("foo").Append("bar").Append("baz");
     auto dict = dict_builder.Finish().ValueOrDie();
     arrow::Int32Builder indices_builder;
     indices_builder.AppendValues({0, 1, 2, 0, 1, 2});
     auto indices = indices_builder.Finish().ValueOrDie();
   
     auto dict_array_result = arrow::DictionaryArray::FromArrays(indices, dict);
     auto dict_array = std::static_pointer_cast<arrow::DictionaryArray>(
         dict_array_result.ValueOrDie());
     
     // Spawn 20 threads that concurrently access dictionary
     std::vector<std::thread> threads;
     for (int i = 0; i < 20; ++i) {
       threads.emplace_back([&dict_array, i]() {
         const auto& dictionary = dict_array->dictionary();
         results[i] = dictionary.get();
       });
     }
     
     // Verify all threads got the same cached dictionary
     for (int i = 1; i < 20; ++i) {
       if (results[i] != results[0]) {
         return 1;
       }
     }
   }
   ```
   
   ### Are there any user-facing changes?
   
   As demonstrated above, it might return a different instance instead of the 
same instance.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to