HyukjinKwon opened a new pull request, #48905: URL: https://github.com/apache/arrow/pull/48905
### Rationale for this change The `DictionaryArray::dictionary()` method has a data race during lazy initialization. When multiple threads concurrently call `dictionary()`, the check then initialize pattern allows multiple threads to pass the check simultaneously, creating multiple `Array` objects instead of one shared cached instance. There were a bit of discussion in https://github.com/apache/arrow/pull/36418#discussion_r1248319000 but I just propose the standard approach with the keep it simple, stupid sprit. ### What changes are included in this PR? This PR makes `DictionaryArray::dictionary` thread-safe by `std::once_flag` and `std::call_once`. ### Are these changes tested? Manually tested with the example below: ```c++ #include <arrow/api.h> #include <thread> #include <vector> int main() { arrow::StringBuilder dict_builder; dict_builder.Append("foo").Append("bar").Append("baz"); auto dict = dict_builder.Finish().ValueOrDie(); arrow::Int32Builder indices_builder; indices_builder.AppendValues({0, 1, 2, 0, 1, 2}); auto indices = indices_builder.Finish().ValueOrDie(); auto dict_array_result = arrow::DictionaryArray::FromArrays(indices, dict); auto dict_array = std::static_pointer_cast<arrow::DictionaryArray>( dict_array_result.ValueOrDie()); // Spawn 20 threads that concurrently access dictionary std::vector<std::thread> threads; for (int i = 0; i < 20; ++i) { threads.emplace_back([&dict_array, i]() { const auto& dictionary = dict_array->dictionary(); results[i] = dictionary.get(); }); } // Verify all threads got the same cached dictionary for (int i = 1; i < 20; ++i) { if (results[i] != results[0]) { return 1; } } } ``` ### Are there any user-facing changes? As demonstrated above, it might return a different instance instead of the same instance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
