[ 
https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083666#comment-17083666
 ] 

Terry Kim commented on SPARK-31256:
-----------------------------------

Let me look into this.

> Dropna doesn't work for struct columns
> --------------------------------------
>
>                 Key: SPARK-31256
>                 URL: https://issues.apache.org/jira/browse/SPARK-31256
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.5
>         Environment: Spark 2.4.5
> Python 3.7.4
>            Reporter: Michael Souder
>            Priority: Major
>
> Dropna using a subset with a column from a struct drops the entire data frame.
> {code:python}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, 
> None)], schema=['age', 'height', 'name'])
> df.show()
> +---+------+-----+
> |age|height| name|
> +---+------+-----+
> |  5|    80|Alice|
> | 10|  null|  Bob|
> | 15|    80| null|
> +---+------+-----+
> # this works just fine
> df.dropna(subset=['name']).show()
> +---+------+-----+
> |age|height| name|
> +---+------+-----+
> |  5|    80|Alice|
> | 10|  null|  Bob|
> +---+------+-----+
> # now add a struct column
> df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 
> 'name'))
> df_with_struct.show(truncate=False)
> +---+------+-----+--------------+
> |age|height|name |struct_col    |
> +---+------+-----+--------------+
> |5  |80    |Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]    |
> |15 |80    |null |[15, 80,]     |
> +---+------+-----+--------------+
> # now dropna drops the whole dataframe when you use struct_col
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+------+----+----------+
> |age|height|name|struct_col|
> +---+------+----+----------+
> +---+------+----+----------+
> {code}
>  I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 
> with python 3.6.8 and in both, the result looks like:
> {code:python}
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+------+-----+--------------+
> |age|height|name |struct_col    |
> +---+------+-----+--------------+
> |5  |80    |Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]    |
> +---+------+-----+--------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to