rtpsw commented on PR #34392:
URL: https://github.com/apache/arrow/pull/34392#issuecomment-1509838035

   The [recent 
commit](https://github.com/apache/arrow/pull/34392/commits/c2e9dbc39ecf31ffad284c0242109198fe2635e7)
 allowed the macOS CI job that previously failed to succeed this time (the 
[single-job 
failure](https://github.com/apache/arrow/actions/runs/4707863345/jobs/8349979477?pr=34392)
 this time is irrelevant). Of course, this does not prove the as-of-join code 
is now free of race conditions, yet the explanation below may help in reasoning 
about what's going on.
   
   The main idea in this recent fix is to ensure `MemoStore` has a valid 
current time when `GetCurrentTime()` is called, since in a future-as-of-join 
`GetCurrentTime()` returns `memo_.current_time_`. The initial current time of 
0, which is set by the `MemoStore` constructor, is invalid; it was observed in 
the [debug session described 
above](https://github.com/apache/arrow/pull/34392#issuecomment-1501767228) and 
led to a null-value in the output. The fix includes the following:
   1. Changing `MemoStore::current_time_` to be an atomic number.
   2.  Adding `MemoStore::UpdateTime`, which updates `MemoStore::current_time_` 
in a multi-threaded-safe way to the given time if the latter is greater.
   3. Invoking `MemoStore::UpdateTime` with the time of the first row of a 
received batch. This happens-before (in the multi-threading sense of the term) 
the batch is pushed to the queue, and therefore ensures that 
`MemoStore::current_time_` is valid by the time `GetLatestTime()` is called 
when the batch is processed out of the queue.
   
   Note that I believe the as-of-join node's process-thread is not responsible 
for the race condition. It just processes in the same order of batches 
received, regardless of this being done in a separate thread. I believe the 
race condition is due to the non-deterministic order of arrival of batches to 
the as-of-join node, and that there's an order of arrival that drives the code 
(before the recent fix) to access an invalid `MemoStore::current_time_`. If so, 
this order could be found and then simulated in a test-case, though I think 
this task can be left to later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to