rtpsw commented on PR #36499: URL: https://github.com/apache/arrow/pull/36499#issuecomment-1625982410
> I am trying understand the issue pre-PR, but you seems to explaining the behavior after the code change here. Correct, sorry. > Can you explain your comment above w.r.t to my question below? Sure. See below. > I would expect when the processing thread invokes AdvanceAndMemoize, it would find the batch_ is NULLPTR and then compute the hash for the first batch. In the pre-PR code, this description is correct for the first batch. For the second batch, `AdvanceAndMemoize` [invokes `GetLatestKey`](https://github.com/apache/arrow/blob/fd949f8fb416025dc9d77c035c1cac5f45ed1b94/cpp/src/arrow/acero/asof_join_node.cc#L865) before it [invokes `Advance`](https://github.com/apache/arrow/blob/fd949f8fb416025dc9d77c035c1cac5f45ed1b94/cpp/src/arrow/acero/asof_join_node.cc#L869), which means `HashesFor` (invoked under `GetLatestKey`) finds `KeyHasher::batch_` referring to the first batch instead of to the second. This is what I meant by "finds hashes that are incorrect for the above batch". Now, taking this a step further, if the two batches have a different address, then things seem OK because the hashes would then be recomputed. I suspect the bug is triggered when the first and second batches happen to have the same address, which we know is a race condition, leading to the use of incorrect hashes. Granted, I haven't yet nailed down the exact triggering condition in the pre-PR code. It would take some effort - let me know if you'd like this investigated. For the time being, in the current version of the PR, I preferred to find the minimal change that works and I can explain. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
