Marco De Nadai created SPARK-33221:
--------------------------------------
Summary: UDF in when operation applied in all the rows regardless
the condition
Key: SPARK-33221
URL: https://issues.apache.org/jira/browse/SPARK-33221
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.0.1, 2.4.5
Reporter: Marco De Nadai
Hi all,
I think there is a bug or, at least, an undocumented behaviour of pyspark UDFs.
The code here is trying to apply the UDF just for a subset of rows (convenient
for long dataframes).
{code:java}
@udf(returnType=BooleanType())
def test_udf(x):
if x is None:
raise Exception(x)
return True
data = [(1,11,1),(1,22,2),(1,33,3),(2,44,1),(3,55,1),(4,66,1)]
dataColumns = ["uid","price","day"]
test = spark.createDataFrame(data=data, schema = dataColumns)w =
Window.partitionBy('uid').orderBy('uid','day')
test = test.withColumn('lag_price', F.lead(F.col('price')).over(w))
print(test.dtypes)
test = test.withColumn('condition', F.col('lag_price').isNotNull())
test.withColumn('appliedUDF', F.when(F.col('condition'),
test_udf(F.col('lag_price'))).otherwise(False)).show()
{code}
It throws this error:
{code:java}
File "<command-3513778682084612>", line 4, in test_udf
Exception: None
{code}
Is it normal? Am I missing something?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]