Enrico Minack created SPARK-30957:
-------------------------------------
Summary: Null-safe variant of Dataset.join(Dataset[_], Seq[String])
Key: SPARK-30957
URL: https://issues.apache.org/jira/browse/SPARK-30957
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.1.0
Reporter: Enrico Minack
The {{Dataset.join(Dataset, Seq[String])}} method provides extra convenience
over {{Dataset.join(Dataset, joinExprs: Column)}} as it does not duplicate the
join columns {{Seq[String]}} in the result {{DataFrame}}. Those columns are
compared with {{===}}. When those join columns need to be compared null-safe
with {{<=>}}, the join condition becomes very verbose and requires extra
{{drop}} operations:
{code:java}
df1.join(df2, df1("a") <=> df2("a") && df1("b") <=>
df2("b")).drop(df2("a")).drop(df2("b")).show()
{code}
Elegant would be the following null-safe join operation:
{code:java}
df1.joinNullSafe(df2, joinColumns)
{code}
Possible namings:
- {{Dataset.joinNullSafe(Dataset[_], Seq[String])}}
- {{Dataset.joinWithNulls(Dataset[_], Seq[String])}}
- {{Dataset.join(Dataset[_], Seq[String], <=>)}}
*I am happy to provide a PR if this Dataset API extension is appreciated.*
This request has been sent to the Apache Spark user and
[dev|http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-dataframe-null-safe-joins-given-a-list-of-columns-tt28842.html]
mailing list by Marcelo Valle.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]