Franklyn Dsouza created SPARK-19844: ---------------------------------------
Summary: UDF in when control function is executed before the when clause is evaluated. Key: SPARK-19844 URL: https://issues.apache.org/jira/browse/SPARK-19844 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.1.0, 2.0.1 Reporter: Franklyn Dsouza Sometimes we try to filter out the argument to a udf using {code}when(clause, udf).otherwise(default){code} but we've noticed that sometimes the udf is being run on data that shouldn't have matched the clause. heres some code to reproduce the issue. {code} from pyspark.sql import functions as F from pyspark.sql import types df = sc.sql.createDataFrame([{'r': None}], schema=types.StructType([types.StructField('r', types.StringType())])) simple_udf = F.udf(lambda ref: ref.strip("/"), types.StringType()) df.withColumn('test', F.when(F.col("r").isNotNull(), simple_udf(F.col("r"))) .otherwise(F.lit(None)) ).collect() {code} This causes an exception because the udf is running on null data. i get AttributeError: 'NoneType' object has no attribute 'strip'. so it looks like the udf is being evaluated before the clause in the when is evaulated. Oddly enough when i change {code}F.col("r").isNotNull(){code} to {code}df["r"] != None{code} then it works. might be related to https://issues.apache.org/jira/browse/SPARK-13773 and https://issues.apache.org/jira/browse/SPARK-15282 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org