oliviermeslin commented on issue #37655: URL: https://github.com/apache/arrow/issues/37655#issuecomment-1755453402
@vkhodygo : thanks or your quick reply. I'm not sure we are talking about the same thing. In my opinion there are actually two separate problems: - arrow cannot join tables where key data is heavier than 4 GB; because of the internals of Acero (this problem is explained [here](https://github.com/apache/arrow/issues/34474#issuecomment-1506009490)()). This problem is likely to be quite difficult to solve, and the solution you suggest (batch processing) is probably the best we can do in the short term. But this is _not_ the bug I think I found. - [PR 35087](https://github.com/apache/arrow/pull/35087) introduced a test to check whether key data is larger than 4 GB, and this test is erroneously applied to the size of the tables to be joined. This is the bug I found, and it looks like this bug could be easily fixed (see my PR). I argue that solving this second problem would be a significant improvement over the current situation (even if the first problem remains), because I suspect that there are many use cases where tables are larger than 4 GB but key data is not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
