Chongyuan Xiang created SPARK-25461:
---------------------------------------
Summary: PySpark Pandas UDF outputs incorrect results when input
columns contain None
Key: SPARK-25461
URL: https://issues.apache.org/jira/browse/SPARK-25461
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.3.1
Environment: I reproduced this issue by running pyspark locally on mac:
Spark version: 2.3.1 pre-built with Hadoop 2.7
Python library versions: pyarrow==0.10.0, pandas==0.20.2
Reporter: Chongyuan Xiang
The following PySpark script uses a simple pandas UDF to calculate a column
given column 'A'. When column 'A' contains None, the results look incorrect.
Script:
{code:java}
import pandas as pd
import random
import pyspark
from pyspark.sql.functions import col, lit, pandas_udf
values = [None] * 30000 + [1.0] * 170000 + [2.0] * 6000000
random.shuffle(values)
pdf = pd.DataFrame({'A': values})
df = spark.createDataFrame(pdf)
@pandas_udf(returnType=pyspark.sql.types.BooleanType())
def gt_2(column):
return (column >= 2).where(column.notnull())
calculated_df = (df.select(['A'])
.withColumn('potential_bad_col', gt_2('A'))
)
calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) |
(col("A").isNull()))
calculated_df.show()
{code}
Output:
{code:java}
+---+-----------------+-----------+
| A|potential_bad_col|correct_col|
+---+-----------------+-----------+
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|1.0| false| false|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
+---+-----------------+-----------+
only showing top 20 rows
{code}
This problem disappears when the number of rows is small or when the input
column does not contain None.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]