Josh Rosen created SPARK-14761:
----------------------------------
Summary: PySpark DataFrame.join should reject invalid join methods
even when join columns are not specified
Key: SPARK-14761
URL: https://issues.apache.org/jira/browse/SPARK-14761
Project: Spark
Issue Type: Bug
Components: PySpark, SQL
Reporter: Josh Rosen
Priority: Minor
In PySpark, the following invalid DataFrame join will not result an error:
{code}
df1.join(df2, how='not-a-valid-join-type')
{code}
The signature for `join` is
{code}
def join(self, other, on=None, how=None):
{code}
and its code ends up completely skipping handling of the `how` parameter when
`on` is `None`:
{code}
if on is not None and not isinstance(on, list):
on = [on]
if on is None or len(on) == 0:
jdf = self._jdf.join(other._jdf)
elif isinstance(on[0], basestring):
if how is None:
jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
else:
assert isinstance(how, basestring), "how should be basestring"
jdf = self._jdf.join(other._jdf, self._jseq(on), how)
else:
{code}
Given that this behavior can mask user errors (as in the above example), I
think that we should refactor this to first process all arguments and then call
the three-argument {{_.jdf.join}}. This would handle the above invalid example
by passing all arguments to the JVM DataFrame for analysis.
I'm not planning to work on this myself, so this bugfix (+ regression test!) is
up for grabs in case someone else wants to do it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]