[GitHub] spark pull request #18484: [SPARK-21224][PYTHON] Call cross join path in PyS...

HyukjinKwon Fri, 30 Jun 2017 01:21:40 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/18484


    [SPARK-21224][PYTHON] Call cross join path in PySpark join with 'how' 
speicified rather than throwing NPE

    ## What changes were proposed in this pull request?
    
    Currently, it throws a NPE when missing columns but join type is speicified 
in join at PySpark as below:
    
    
    ```python
    spark.conf.set("spark.sql.crossJoin.enabled", "false")
    spark.range(1).join(spark.range(1), how="inner").show()
    ```
    
    ```
    Traceback (most recent call last):
    ...
    py4j.protocol.Py4JJavaError: An error occurred while calling o66.join.
    : java.lang.NullPointerException
        at org.apache.spark.sql.Dataset.join(Dataset.scala:931)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    ...
    ```
    
    ```python
    spark.conf.set("spark.sql.crossJoin.enabled", "true")
    spark.range(1).join(spark.range(1), how="inner").show()
    ```
    
    ```
    ...
    py4j.protocol.Py4JJavaError: An error occurred while calling o84.join.
    : java.lang.NullPointerException
        at org.apache.spark.sql.Dataset.join(Dataset.scala:931)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    ...
    ```
    
    This PR suggests to follow Scala's one as below:
    
    ```scala
    scala> spark.conf.set("spark.sql.crossJoin.enabled", "false")
    
    scala> spark.range(1).join(spark.range(1), Seq.empty[String], 
"inner").show()
    ```
    
    ```
    org.apache.spark.sql.AnalysisException: Detected cartesian product for 
INNER join between logical plans
    Range (0, 1, step=1, splits=Some(8))
    and
    Range (0, 1, step=1, splits=Some(8))
    Join condition is missing or trivial.
    Use the CROSS JOIN syntax to allow cartesian products between these 
relations.;
    ...
    ```
    
    ```
    scala> spark.conf.set("spark.sql.crossJoin.enabled", "true")
    
    scala> spark.range(1).join(spark.range(1), Seq.empty[String], 
"inner").show()
    ```
    ```
    +---+---+
    | id| id|
    +---+---+
    |  0|  0|
    +---+---+
    ```
    
    
    **After**
    
    
    ```python
    spark.conf.set("spark.sql.crossJoin.enabled", "false")
    spark.range(1).join(spark.range(1), how="inner").show()
    ```
    
    ```
    Traceback (most recent call last):
    ...
    pyspark.sql.utils.AnalysisException: u'Detected cartesian product for INNER 
join between logical plans\nRange (0, 1, step=1, splits=Some(8))\nand\nRange 
(0, 1, step=1, splits=Some(8))\nJoin condition is missing or trivial.\nUse the 
CROSS JOIN syntax to allow cartesian products between these relations.;'
    ```
    
    ```python
    spark.conf.set("spark.sql.crossJoin.enabled", "true")
    spark.range(1).join(spark.range(1), how="inner").show()
    ```
    ```
    +---+---+
    | id| id|
    +---+---+
    |  0|  0|
    +---+---+
    ```
    
    ## How was this patch tested?
    
    Added tests in `python/pyspark/sql/tests.py`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-21264

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18484.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18484
    
----
commit 6ec509b990c985ad8519aa563cfe08c24e6847ae
Author: hyukjinkwon <[email protected]>
Date:   2017-06-30T08:11:01Z

    Call cross join path in PySpark join rather than throwing NPE

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18484: [SPARK-21224][PYTHON] Call cross join path in PyS...

Reply via email to