[
https://issues.apache.org/jira/browse/ARROW-14293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427883#comment-17427883
]
David Li commented on ARROW-14293:
----------------------------------
Dataset.join returning an iterator makes sense to me.
Backing up though, is there a higher level plan for what sorts of functionality
we're trying to expose? Are we targeting a subset of Pandas, perhaps? Obviously
full Pandas compatibility is not feasible or necessarily desirable, but it
might be worth considering the API as a whole before building out the parts.
(Apologies if this is already considered somewhere and this ticket is merely
the result of that.)
I agree with Weston's point since then I think natural questions might include
things like, we can do a filter and then a join, but how do we filter after a
join? (Collect into a table, then treat as a Dataset? This gets awkward/verbose
fast)
> [Python] Basic Join functionality in PyArrow
> --------------------------------------------
>
> Key: ARROW-14293
> URL: https://issues.apache.org/jira/browse/ARROW-14293
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Alessandro Molina
> Priority: Major
> Fix For: 7.0.0
>
>
> We want to expose a {{Table.join}} and {{Dataset.join}} functionalities in
> PyArrow which can leverage our join feature from the ExecPlan to expose.
> The {{Table.join}} can easily return a new {{Table}}, questions about what
> {{Dataset.join}} might return are more complex as it probably doesn't make
> much sense to return a new {{Dataset}} given that the result won't map to any
> files on disk
--
This message was sent by Atlassian Jira
(v8.3.4#803005)