[ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17100.
--------------------------------
       Resolution: Fixed
    Fix Version/s: 2.2.0
                   2.0.1

Issue resolved by pull request 15103
[https://github.com/apache/spark/pull/15103]

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-17100
>                 URL: https://issues.apache.org/jira/browse/SPARK-17100
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>         Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>            Reporter: Tim Sell
>            Assignee: Davies Liu
>             Fix For: 2.0.1, 2.2.0
>
>         Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
>     spark = SparkSession.builder.appName("test").getOrCreate()
>     left = spark.createDataFrame([Row(a=1)])
>     right = spark.createDataFrame([Row(a=1)])
>     df = left.join(right, on='a', how='left_outer')
>     df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
>     df = df.filter('b = "x"')
>     df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, <lambda>(a#0L) AS b#8]
>    +- Project [a#0L]
>       +- Join LeftOuter, (a#0L = a#3L)
>          :- LogicalRDD [a#0L]
>          +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, <lambda>(a#0L) AS b#8]
>    +- Project [a#0L]
>       +- Join LeftOuter, (a#0L = a#3L)
>          :- LogicalRDD [a#0L]
>          +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> <lambda>(input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> <lambda>(input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to