[GitHub] [arrow] westonpace commented on issue #34474: [Python] Table.join() produces incorrect results for large inputs

via GitHub Wed, 12 Apr 2023 14:53:38 -0700


westonpace commented on issue #34474:
URL: https://github.com/apache/arrow/issues/34474#issuecomment-1506009490


   I managed to look into this today.  The bad news is that this join isn't 
supported.
   
   There are 9 key columns.  The date and int columns are 8 bytes each.  The 
string columns are variable but at least 4 bytes and average out close enough 
to 8 bytes that we can just use 8.
   
   72,000,000 * 8 bytes * 9 columns ~ 6GB of data.  We store key data in a 
structure that we index with uint32_t which means we can have at most 4GiB of 
key data.
   
   The current behavior is that we trigger an overflow and clobber existing 
data in our keys array which is leading to the results you are seeing 
(incorrect data).
   
   I'm working on a fix that will detect this condition and fail the join when 
it encounters more than 4GiB key data.  My guess is that by implementing hash 
join spilling (e.g. https://github.com/apache/arrow/pull/13669) we would 
naturally increase this limit.  Until then the best we can do is fail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #34474: [Python] Table.join() produces incorrect results for large inputs

Reply via email to