westonpace opened a new pull request, #35087:
URL: https://github.com/apache/arrow/pull/35087

   ### Rationale for this change
   
   This fixes the test in #34474 though there are likely still other bad 
scenarios with large joins.  I've fixed this one since the behavior (invalid 
data) is particularly bad.  Most of the time if there is too much data I'm 
guessing we probably just crash.  Still, I think a test suite of some kind 
stressing large joins would be good to have.  Perhaps this could be added if 
someone finds time to work on join spilling.
   
   ### What changes are included in this PR?
   
   If the join will require more than 4GiB of key data it should now return an 
invalid status instead of invalid data.
   
   ### Are these changes tested?
   
   No.  I created a unit test but it requires over 16GiB of RAM (Besides the 
input data itself (4GiB), by the time you get 4GiB of key data there are 
various other join state buffers that also grow.  The test also took nearly a 
minute to run.  I think investigation and creation of a test suite for large 
joins is probably a standalone effort.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to