rtpsw commented on PR #34392:
URL: https://github.com/apache/arrow/pull/34392#issuecomment-1501767228

   I fixed the race condition, at least locally, as I now see [this macOS CI 
failure](https://github.com/apache/arrow/actions/runs/4656962922/jobs/8241089942?pr=34392).
   
   In a debug session, I observed [that 
`GetCurrentTime()`](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L571)
 for one of the RHS tables was returning 0 when it was expected to be ahead, 
leading to [the tolerance not 
accepting](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L865)
 this 0 time [in 
`CompositeReferenceTable::Emplace`](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L834)
 and to [keeping a null 
batch](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L874),
 which [checked 
up](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L1019)
 as [a null 
value](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arro
 w/acero/asof_join_node.cc#L1024) in the [materialized 
column](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L1011)
 and ended up in the output. This scenario occurred only in a 
future-as-of-join, where `GetCurrentTime()` [returns 
`memo_.current_time_`](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L572).
 Following up on this, I noticed that `MemoStore` was [updating 
`current_time_`](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L308-L324)
 only upon 
[entry-removal](https://github.com/apache/arrow/blob/e3dd1cc7f918d3b4709ce712e7d08efafd8c9601/cpp/src/arrow/acero/asof_join_node.cc#L283).
 Apparently, the (not-so-controlled) timing of this entry-removal can cause the 
race condition. I fixed `MemoStore` to update the current time upon 
entry-store, and simplified its updating upon ent
 ry-removal.
   
   Following this fix, I did not locally observe the race condition in several 
runs of the tester, whereas before I observed it after one or two runs. There 
is still the macOS failure, however, I don't have access to a macOS to locally 
test on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to