Enrico Minack created SPARK-30957: ------------------------------------- Summary: Null-safe variant of Dataset.join(Dataset[_], Seq[String]) Key: SPARK-30957 URL: https://issues.apache.org/jira/browse/SPARK-30957 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Enrico Minack
The {{Dataset.join(Dataset, Seq[String])}} method provides extra convenience over {{Dataset.join(Dataset, joinExprs: Column)}} as it does not duplicate the join columns {{Seq[String]}} in the result {{DataFrame}}. Those columns are compared with {{===}}. When those join columns need to be compared null-safe with {{<=>}}, the join condition becomes very verbose and requires extra {{drop}} operations: {code:java} df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> df2("b")).drop(df2("a")).drop(df2("b")).show() {code} Elegant would be the following null-safe join operation: {code:java} df1.joinNullSafe(df2, joinColumns) {code} Possible namings: - {{Dataset.joinNullSafe(Dataset[_], Seq[String])}} - {{Dataset.joinWithNulls(Dataset[_], Seq[String])}} - {{Dataset.join(Dataset[_], Seq[String], <=>)}} *I am happy to provide a PR if this Dataset API extension is appreciated.* This request has been sent to the Apache Spark user and [dev|http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-dataframe-null-safe-joins-given-a-list-of-columns-tt28842.html] mailing list by Marcelo Valle. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org