[ https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852089#comment-16852089 ]
Josh Rosen commented on SPARK-27785: ------------------------------------ I think this might require a little bit more design work before beginning implementation, especially along the following dimensions: # What arity do we max out at? At work I have some use-cases involving joins of 10+ tables. # Do we put this into {{Dataset}}? Or factor it out into a separate {{Multijoin}} helper object? Is the precedence for this in other frameworks (Scalding, Flink, etc) which we could mirror? # Instead of writing out each case by hand, can we write some Scala code to generate the signatures / code for us (similar to how the {{ScalaUDF}} overloads were defined)? # In addition to saving some typing / projection, does this new API let us solve SPARK-19468 for a limited subset of cases? > Introduce .joinWith() overloads for typed inner joins of 3 or more tables > ------------------------------------------------------------------------- > > Key: SPARK-27785 > URL: https://issues.apache.org/jira/browse/SPARK-27785 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.0.0 > Reporter: Josh Rosen > Priority: Major > > Today it's rather painful to do a typed dataset join of more than two tables: > {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining > on a third inner join requires users to specify a complicated join condition > (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), > resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become > even more painful if you want to layer on a fourth join. Using {{.map()}} to > flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too. > To simplify this use case, I propose to introduce a new set of overloads of > {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some > reasonable number (say, 6). For example: > {code:java} > Dataset[T].joinWith[T1, T2]( > ds1: Dataset[T1], > ds2: Dataset[T2], > condition: Column > ): Dataset[(T, T1, T2)] > Dataset[T].joinWith[T1, T2]( > ds1: Dataset[T1], > ds2: Dataset[T2], > ds3: Dataset[T3], > condition: Column > ): Dataset[(T, T1, T2, T3)]{code} > I propose to do this only for inner joins (consistent with the default join > type for {{joinWith}} in case joins are not specified). > I haven't though about this too much yet and am not committed to the API > proposed above (it's just my initial idea), so I'm open to suggestions for > alternative typed APIs for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org