Tim Sell created SPARK-17100:
--------------------------------
Summary: pyspark filter on a udf column after join gives
java.lang.UnsupportedOperationException
Key: SPARK-17100
URL: https://issues.apache.org/jira/browse/SPARK-17100
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.0.0
Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
Reporter: Tim Sell
In pyspark, when filtering on a udf derived column after some join types,
the optimized logical plan results is a java.lang.UnsupportedOperationException.
I could not replicate this in scala code from the shell, just python. It is a
pyspark regression from spark 1.6.2.
This can be replicated with: bin/spark-submit bug.py
{code:python:title=bug.py}
import pyspark.sql.functions as F
from pyspark.sql import Row, SparkSession
if __name__ == '__main__':
spark = SparkSession.builder.appName("test").getOrCreate()
left = spark.createDataFrame([Row(a=1)])
right = spark.createDataFrame([Row(a=1)])
df = left.join(right, on='a', how='left_outer')
df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
df = df.filter('b = "x"')
df.explain(extended=True)
{code}
The output is:
{code}
== Parsed Logical Plan ==
'Filter ('b = x)
+- Project [a#0L, <lambda>(a#0L) AS b#8]
+- Project [a#0L]
+- Join LeftOuter, (a#0L = a#3L)
:- LogicalRDD [a#0L]
+- LogicalRDD [a#3L]
== Analyzed Logical Plan ==
a: bigint, b: string
Filter (b#8 = x)
+- Project [a#0L, <lambda>(a#0L) AS b#8]
+- Project [a#0L]
+- Join LeftOuter, (a#0L = a#3L)
:- LogicalRDD [a#0L]
+- LogicalRDD [a#3L]
== Optimized Logical Plan ==
java.lang.UnsupportedOperationException: Cannot evaluate expression:
<lambda>(input[0, bigint, true])
== Physical Plan ==
java.lang.UnsupportedOperationException: Cannot evaluate expression:
<lambda>(input[0, bigint, true])
{code}
It fails when the join is:
* how='outer', on=column expression
* how='left_outer', on=string or column expression
* how='right_outer', on=string or column expression
It passes when the join is:
* how='inner', on=string or column expression
* how='outer', on=string
I made some tests to demonstrate each of these.
Run with bin/spark-submit test_bug.py
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]