rtpsw commented on PR #36499:
URL: https://github.com/apache/arrow/pull/36499#issuecomment-1625982410

   > I am trying understand the issue pre-PR, but you seems to explaining the 
behavior after the code change here.
   
   Correct, sorry.
   
   > Can you explain your comment above w.r.t to my question below?
   
   Sure. See below.
   
   > I would expect when the processing thread invokes AdvanceAndMemoize, it 
would find the batch_ is NULLPTR and then compute the hash for the first batch.
   
   In the pre-PR code, this description is correct for the first batch. For the 
second batch, `AdvanceAndMemoize` [invokes 
`GetLatestKey`](https://github.com/apache/arrow/blob/fd949f8fb416025dc9d77c035c1cac5f45ed1b94/cpp/src/arrow/acero/asof_join_node.cc#L865)
 before it [invokes 
`Advance`](https://github.com/apache/arrow/blob/fd949f8fb416025dc9d77c035c1cac5f45ed1b94/cpp/src/arrow/acero/asof_join_node.cc#L869),
 which means `HashesFor` (invoked under `GetLatestKey`) finds 
`KeyHasher::batch_` referring to the first batch instead of to the second. This 
is what I meant by "finds hashes that are incorrect for the above batch".
   
   Now, taking this a step further, if the two batches have a different 
address, then things seem OK because the hashes would then be recomputed. I 
suspect the bug is triggered when the first and second batches happen to have 
the same address, which we know is a race condition, leading to the use of 
incorrect hashes. Granted, I haven't yet nailed down the exact triggering 
condition in the pre-PR code. It would take some effort - let me know if you'd 
like this investigated. For the time being, in the current version of the PR, I 
preferred to find the minimal change that works and I can explain.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to