Josh Rosen created SPARK-14761:
----------------------------------

             Summary: PySpark DataFrame.join should reject invalid join methods 
even when join columns are not specified
                 Key: SPARK-14761
                 URL: https://issues.apache.org/jira/browse/SPARK-14761
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
            Reporter: Josh Rosen
            Priority: Minor


In PySpark, the following invalid DataFrame join will not result an error:

{code}
df1.join(df2, how='not-a-valid-join-type')
{code}

The signature for `join` is

{code}
    def join(self, other, on=None, how=None):
{code}

and its code ends up completely skipping handling of the `how` parameter when 
`on` is `None`:

{code}
 if on is not None and not isinstance(on, list):
            on = [on]

        if on is None or len(on) == 0:
            jdf = self._jdf.join(other._jdf)
        elif isinstance(on[0], basestring):
            if how is None:
                jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
            else:
                assert isinstance(how, basestring), "how should be basestring"
                jdf = self._jdf.join(other._jdf, self._jseq(on), how)
        else:
{code}

Given that this behavior can mask user errors (as in the above example), I 
think that we should refactor this to first process all arguments and then call 
the three-argument {{_.jdf.join}}. This would handle the above invalid example 
by passing all arguments to the JVM DataFrame for analysis.

I'm not planning to work on this myself, so this bugfix (+ regression test!) is 
up for grabs in case someone else wants to do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to