oliviermeslin commented on issue #37655:
URL: https://github.com/apache/arrow/issues/37655#issuecomment-1755453402

   @vkhodygo : thanks or your quick reply. I'm not sure we are talking about 
the same thing. In my opinion there are actually two separate problems:
   
   - arrow cannot join tables where key data is heavier than 4 GB; because of 
the internals of Acero (this problem is explained 
[here](https://github.com/apache/arrow/issues/34474#issuecomment-1506009490)()).
 This problem is likely to be quite difficult to solve, and the solution you 
suggest (batch processing) is probably the best we can do in the short term. 
But this is _not_ the bug I think I found.
   - [PR 35087](https://github.com/apache/arrow/pull/35087) introduced a test 
to check whether key data is larger than 4 GB, and this test is erroneously 
applied to the size of the tables to be joined. This is the bug I found, and it 
looks like this bug could be easily fixed (see my PR).
   
   I argue that solving this second problem would be a significant improvement 
over the current situation (even if the first problem remains), because I 
suspect that there are many use cases where tables are larger than 4 GB but key 
data is not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to