Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

via GitHub Thu, 29 May 2025 23:19:00 -0700


xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2921331804


   > This is an independent problem. Because join is concatenating columns from 
both sides, so it is possible that the result table contains columns with the 
same name. If so, you won't be able to further reference a such column without 
ambiguity. You can specify output_suffix_for_left/right to append unique 
identifiers to their column names, so that you can disambiguate them.
   
   I see, so if I understand this correctly, ideally, we probably should assign 
distinct key for both columns before using filter expression since 
output_suffix_for_left would only works for output at the end of the workflow, 
right?  (sorry if this is a dumb question...) i.e.,  something like this won't 
work
   
   ```python3
       join_opts = HashJoinNodeOptions(
           "inner", left_keys="key", right_keys="key",
           output_suffix_for_left="_left",output_suffix_for_right="_right",
           filter=pc.equal(pc.field('key_left'), 2))     # <------------ will 
hit key not found in both schemas.
       joined = Declaration(
           "hashjoin", options=join_opts, inputs=[left_source, right_source])
       result = joined.to_table()
   ``` 
   
   if we don't use filter at all, we are ok with same column, and we can use 
output_suffix_for_left to help for the output only. @zanmato1984 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

Reply via email to