[
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849880#comment-16849880
]
Swapnil edited comment on SPARK-27785 at 5/28/19 4:09 PM:
----------------------------------------------------------
I like this idea. I can start working on it if it seems to be useful.
[~hyukjin.kwon] and [~joshrosen]- Do you have any initial thoughts/suggestions
on this proposal?
was (Author: swapnilushinde):
I like this idea. I can start working on it if it seems to be promising.
[~hyukjin.kwon] - Do you have any initial thoughts/suggestions on this proposal?
> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -------------------------------------------------------------------------
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Josh Rosen
> Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables:
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining
> on a third inner join requires users to specify a complicated join condition
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK),
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become
> even more painful if you want to layer on a fourth join. Using {{.map()}} to
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
> ds1: Dataset[T1],
> ds2: Dataset[T2],
> condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
> ds1: Dataset[T1],
> ds2: Dataset[T2],
> ds3: Dataset[T3],
> condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API
> proposed above (it's just my initial idea), so I'm open to suggestions for
> alternative typed APIs for this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]