zanmato1984 commented on PR #41614: URL: https://github.com/apache/arrow/pull/41614#issuecomment-2107792835
> It is hard to understand why this fixes the issue at all. The comments talk about an "atomic call" but I fail to see what is atomic here. This might even work by chance... Thanks for looking. And sorry about the bad wording - I should've come up with a better name than "atomic call". But the issue this PR is trying to fix is carefully studied and the fix is not by chance. Maybe I should've put this more explicitly than in the PR description, the issue can be 100% reproduced by injecting special delay as shown in this commit https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929, so the reason is clear enough. I shall explain it more: Suppose the left batch is pushed to asof join node at time `100`: https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1398 Then at time `200` the `Process` thread starts and in turn calls `ProcessInner`: https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1034 Within `ProcessInner`, it first advances the right batch (necessary for emitting results) (still time `200`, and it sees empty right batch): https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL967 Then at time `300` the right batch is pushed to asof join node: https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1398 At last, time `400`, `ProcessInner` checks if the right batch is up-to-date with the left batch (if yes, emits result): https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL974 which it should not be the case because the right batch is not advanced at time `200` (because it is not pushed yet). But in `IsUpToDateWithLhsRow`, it checks for emptiness again: https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL942 and https://github.com/zanmato1984/arrow/commit/ea3b24c5f7308fe42f60dad41f51dbcbc1a54929#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL552 And this time, it's not empty (as the right batch is already pushed) and `ProcessInner` thinks "Great, you're up-to-date and let's emit some results". As you can see this is a logical race about the right batch's emptiness between `Push` and `ProcessInner`. My original attempt to explain the fix is (improperly) describing it as an "atomic call". But as Weston suggested, maybe "single call" is better. And hope this explanation could give you more confidence on this fix :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
