westonpace commented on issue #34474: URL: https://github.com/apache/arrow/issues/34474#issuecomment-1506009490
I managed to look into this today. The bad news is that this join isn't supported. There are 9 key columns. The date and int columns are 8 bytes each. The string columns are variable but at least 4 bytes and average out close enough to 8 bytes that we can just use 8. 72,000,000 * 8 bytes * 9 columns ~ 6GB of data. We store key data in a structure that we index with uint32_t which means we can have at most 4GiB of key data. The current behavior is that we trigger an overflow and clobber existing data in our keys array which is leading to the results you are seeing (incorrect data). I'm working on a fix that will detect this condition and fail the join when it encounters more than 4GiB key data. My guess is that by implementing hash join spilling (e.g. https://github.com/apache/arrow/pull/13669) we would naturally increase this limit. Until then the best we can do is fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
