rtpsw commented on code in PR #34392:
URL: https://github.com/apache/arrow/pull/34392#discussion_r1215407511


##########
cpp/src/arrow/acero/asof_join_node.cc:
##########
@@ -1059,7 +1232,7 @@ class AsofJoinNode : public ExecNode {
         // If RHS is finished, then we know it's up to date
         if (rhs.CurrentEmpty())
           return false;  // RHS isn't finished, but is empty --> not up to date
-        if (lhs_ts >= rhs.GetCurrentTime())

Review Comment:
   I looked into this to some degree. When '>=' is used instead, as-of-join 
hangs for the `Basic7Forward` test-case, because both right tables have distant 
times (right at the beginning) and none of them advances.
   
   I tend to think that '>=' here won't work in general. Consider the case 
where the left table is at time `t` and a given right table has multiple rows 
with time `t`. In a past-join, we'd want to advance the right table past `t` 
and overwrite the stored entry up to the last row with time `t`, which would be 
ready for joining. In a future-join, we'd want to advance the same way but to 
store all rows except the first as future entries. Now, consider the case where 
the given right table has a row with a distant time `t+tolerance+1`. In a 
future-join, we'd want to advance the right table to this row and store it but 
only make it current when it is no longer distant. So, in a clearer (if not 
fixing) solution, I suspect the tolerance horizon needs to be accounted for in 
the condition here as well as in the storing/joining.
   
   This needs further looking into in a separate issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to