kolfild26 commented on issue #44513: URL: https://github.com/apache/arrow/issues/44513#issuecomment-2539147888
@zanmato1984 Sorry for such delay Yeah, switch to `RIGHT JOIN` fixes the issue. I wish I'd done it from the very beginning. So, a working case here is like: `large_table.join(small_table, join_type='right outer')`. No segfault and the result is correct. But `small_table.join(large_table, join_type='left outer')` still leads to segfault on v.18.1.0 ``` small.shape (18201475, 9) large.shape (360449051, 4) small.schema ID_DEV_STYLECOLOR: int64 ID_DEV_STYLECOLOR_SIZE: int64 ID_COLLECTION: int64 ID_PARTITION_DIV_TMA: int64 ID_END_QUOTING_DAY: int64 ID_DEPARTMENT: int64 ID_BEGIN_QUOTING_DAY: int64 INTAKE_DATE: timestamp[us] UPA_MIN: int64 large.schema ID_DEV_STYLECOLOR_SIZE: int64 ID_DEPARTMENT: int64 ID_COLLECTION: int64 PL_VALUE: int64 ``` `large.join(small, keys=['ID_DEV_STYLECOLOR_SIZE', 'ID_DEPARTMENT', 'ID_COLLECTION'], join_type='right outer')` ✅ `small.join(large[0:200000000], keys=['ID_DEV_STYLECOLOR_SIZE', 'ID_DEPARTMENT', 'ID_COLLECTION'], join_type='left outer')` ✅ (no segfault at least) `small.join(large, keys=['ID_DEV_STYLECOLOR_SIZE', 'ID_DEPARTMENT', 'ID_COLLECTION'], join_type='left outer')` ❌ Segfault appears once the large table size reaches ~250m in the left join case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
