rtpsw commented on PR #34392:
URL: https://github.com/apache/arrow/pull/34392#issuecomment-1531964894

   **First problem: hang on distant times**
   
   What was the problem? In a future as-of-join, when the right table's next 
timestamp was distant (i.e., beyond the future tolerance) compared to the left 
table's, the join hanged.
    
   What was the cause of the problem? The as-of-join node mishandled the 
`MemoStore` maintenance, failed to advance any of the tables and went into an 
infinite loop. The as-of-join node wasn't tested before with distant times.
   
   What was the fix and how? 
[This](https://github.com/apache/arrow/pull/34392/commits/000219736c0abb2e0c9525963f083c0ab04d16cc)
 and [this 
commit](https://github.com/apache/arrow/pull/34392/commits/0dc7fa2339ddfa73be30f4780a35f3ca057f8957)
 fixed the `MemoStore` maintenance and the condition for advancing, and added a 
test-case with distant times.
   
   **Second problem: non-deterministic output**
   
   What was the problem? The newly added test-case produced a different output 
rarely and only on specific platforms/CI-jobs.
   
   What was the cause of the problem? The as-of-join node used cached hashes, 
instead of computing new ones, for the key columns of a new batch. This 
happened because the new batch had the same pointer-address as the previous 
one, and the cache-invalidation condition relied on the pointer-address 
changing for a new batch. This rare same-pointer-address condition, which 
triggered the bug, occurred only on specific platforms/CI-jobs, likely under 
restricted memory resources.
   
   What was the fix and how? The 
[fix](https://github.com/apache/arrow/pull/34392/commits/34134f0b40ea241fa3c686df06c487d0539b9c99)
 added cache invalidation upon receiving a new batch, allowing the hashes to be 
recomputed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to