zanmato1984 commented on PR #41614:
URL: https://github.com/apache/arrow/pull/41614#issuecomment-2107792835

   > It is hard to understand why this fixes the issue at all. The comments 
talk about an "atomic call" but I fail to see what is atomic here. This might 
even work by chance...
   
   Thanks for looking. And sorry about the bad wording - I should've come up 
with a better name than "atomic call". But the issue this PR is trying to fix 
is carefully studied and the fix is not by chance. Maybe I should've put this 
more explicitly than in the PR description, the issue can be 100% reproduced by 
injecting special delay as shown in this commit
   
https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929,
 so the reason is clear enough.
   
   I shall explain it more:
   Suppose the left batch is pushed to asof join node at time `100`:
   
https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1398
   Then at time `200` the `Process` thread starts and in turn calls 
`ProcessInner`:
   
https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1034
   Within `ProcessInner`, it first advances the right batch (necessary for 
emitting results) (still time `200`, and it sees empty right batch):
   
https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL967
   Then at time `300` the right batch is pushed to asof join node:
   
https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1398
   At last, time `400`, `ProcessInner` checks if the right batch is up-to-date 
with the left batch (if yes, emits result):
   
https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL974
   which it should not be the case because the right batch is not advanced at 
time `200` (because it is not pushed yet). But in `IsUpToDateWithLhsRow`, it 
checks for emptiness again:
   
https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL942
   and
   
https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL552
   And this time, it's not empty (as the right batch is already pushed) and 
`ProcessInner` thinks "Great, you're up-to-date and let's emit some results".
   
   As you can see this is a logical race about the right batch's emptiness 
between `Push` and `ProcessInner`. My original attempt to explain the fix is 
(improperly) describing it as an "atomic call". But as Weston suggested, maybe 
"single call" is better. And hope this explanation could give you more 
confidence on this fix :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to