Marco De Nadai created SPARK-33221:
--------------------------------------

             Summary: UDF in when operation applied in all the rows regardless 
the condition
                 Key: SPARK-33221
                 URL: https://issues.apache.org/jira/browse/SPARK-33221
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.0.1, 2.4.5
            Reporter: Marco De Nadai


Hi all,

I think there is a bug or, at least, an undocumented behaviour of pyspark UDFs. 
The code here is trying to apply the UDF just for a subset of rows (convenient 
for long dataframes).

 
{code:java}
@udf(returnType=BooleanType())
def test_udf(x):
  if x is None:
    raise Exception(x)
  return True
  
data = [(1,11,1),(1,22,2),(1,33,3),(2,44,1),(3,55,1),(4,66,1)]
dataColumns = ["uid","price","day"]
test = spark.createDataFrame(data=data, schema = dataColumns)w = 
Window.partitionBy('uid').orderBy('uid','day')
test = test.withColumn('lag_price', F.lead(F.col('price')).over(w))
print(test.dtypes)
test = test.withColumn('condition', F.col('lag_price').isNotNull())
test.withColumn('appliedUDF', F.when(F.col('condition'), 
test_udf(F.col('lag_price'))).otherwise(False)).show()
{code}
It throws this error:
{code:java}
File "<command-3513778682084612>", line 4, in test_udf
Exception: None
{code}
Is it normal? Am I missing something?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to