rtpsw commented on PR #34392: URL: https://github.com/apache/arrow/pull/34392#issuecomment-1509838035
The [recent commit](https://github.com/apache/arrow/pull/34392/commits/c2e9dbc39ecf31ffad284c0242109198fe2635e7) allowed the macOS CI job that previously failed to succeed this time (the [single-job failure](https://github.com/apache/arrow/actions/runs/4707863345/jobs/8349979477?pr=34392) this time is irrelevant). Of course, this does not prove the as-of-join code is now free of race conditions, yet the explanation below may help in reasoning about what's going on. The main idea in this recent fix is to ensure `MemoStore` has a valid current time when `GetCurrentTime()` is called, since in a future-as-of-join `GetCurrentTime()` returns `memo_.current_time_`. The initial current time of 0, which is set by the `MemoStore` constructor, is invalid; it was observed in the [debug session described above](https://github.com/apache/arrow/pull/34392#issuecomment-1501767228) and led to a null-value in the output. The fix includes the following: 1. Changing `MemoStore::current_time_` to be an atomic number. 2. Adding `MemoStore::UpdateTime`, which updates `MemoStore::current_time_` in a multi-threaded-safe way to the given time if the latter is greater. 3. Invoking `MemoStore::UpdateTime` with the time of the first row of a received batch. This happens-before (in the multi-threading sense of the term) the batch is pushed to the queue, and therefore ensures that `MemoStore::current_time_` is valid by the time `GetLatestTime()` is called when the batch is processed out of the queue. Note that I believe the as-of-join node's process-thread is not responsible for the race condition. It just processes in the same order of batches received, regardless of this being done in a separate thread. I believe the race condition is due to the non-deterministic order of arrival of batches to the as-of-join node, and that there's an order of arrival that drives the code (before the recent fix) to access an invalid `MemoStore::current_time_`. If so, this order could be found and then simulated in a test-case, though I think this task can be left to later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
