kolfild26 commented on issue #44513:
URL: https://github.com/apache/arrow/issues/44513#issuecomment-2539147888

   @zanmato1984 
   Sorry for such delay
   Yeah, switch to `RIGHT JOIN` fixes the issue. I wish I'd done it from the 
very beginning.
   So, a working case here is like:
   `large_table.join(small_table, join_type='right outer')`. No segfault and 
the result is correct.
   But `small_table.join(large_table, join_type='left outer')` still leads to 
segfault on v.18.1.0
   
   ```
   small.shape
   (18201475, 9)
   large.shape
   (360449051, 4)
   
   small.schema
   ID_DEV_STYLECOLOR: int64
   ID_DEV_STYLECOLOR_SIZE: int64
   ID_COLLECTION: int64
   ID_PARTITION_DIV_TMA: int64
   ID_END_QUOTING_DAY: int64
   ID_DEPARTMENT: int64
   ID_BEGIN_QUOTING_DAY: int64
   INTAKE_DATE: timestamp[us]
   UPA_MIN: int64
   
   large.schema
   ID_DEV_STYLECOLOR_SIZE: int64
   ID_DEPARTMENT: int64
   ID_COLLECTION: int64
   PL_VALUE: int64
   ```
   
   `large.join(small, keys=['ID_DEV_STYLECOLOR_SIZE', 'ID_DEPARTMENT', 
'ID_COLLECTION'], join_type='right outer')` ✅ 
   
   `small.join(large[0:200000000], keys=['ID_DEV_STYLECOLOR_SIZE', 
'ID_DEPARTMENT', 'ID_COLLECTION'], join_type='left outer')` ✅ (no segfault at 
least)
   
   `small.join(large, keys=['ID_DEV_STYLECOLOR_SIZE', 'ID_DEPARTMENT', 
'ID_COLLECTION'], join_type='left outer')` ❌ 
   
   Segfault appears once the large table size reaches ~250m in the left join 
case
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to