[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

Stu (Michael Stewart) (JIRA) Tue, 02 Oct 2018 09:56:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635811#comment-16635811
 ]


Stu (Michael Stewart) commented on SPARK-25461:
-----------------------------------------------

Thanks all; to me the largest issue with this behavior is the silent failure — 
there is a relatively sane workaround to the issue but the silent failure is 
deeply unnerving. Especially as the same code runs in pandas proper with no 
hint of the issue. Even raising some runtime error would be a huge win from my 
perspective! 

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-25461
>                 URL: https://issues.apache.org/jira/browse/SPARK-25461
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.1
>         Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>            Reporter: Chongyuan Xiang
>            Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 30000 + [1.0] * 170000 + [2.0] * 6000000
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
>     return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
>     .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-----------------+-----------+
> | A|potential_bad_col|correct_col|
> +---+-----------------+-----------+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-----------------+-----------+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

Reply via email to