[jira] [Commented] (SPARK-31256) Dropna doesn't work for struct columns

Sunitha Kambhampati (Jira) Tue, 14 Apr 2020 15:36:27 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083659#comment-17083659
 ]


Sunitha Kambhampati commented on SPARK-31256:
---------------------------------------------

I can repro the issue using the Scala api on trunk. 

It looks like SPARK-30065 explicitly removed the support for nested column 
resolution in drop.    The change went into trunk and as well as the 2.4.5 
branch.

E.g in the test:

https://github.com/apache/spark/blame/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala#L302

cc [~cloud_fan], [~imback82]    This seems to be a regression. Is there a 
reason to remove this behavior?

 

> Dropna doesn't work for struct columns
> --------------------------------------
>
>                 Key: SPARK-31256
>                 URL: https://issues.apache.org/jira/browse/SPARK-31256
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.5
>         Environment: Spark 2.4.5
> Python 3.7.4
>            Reporter: Michael Souder
>            Priority: Major
>
> Dropna using a subset with a column from a struct drops the entire data frame.
> {code:python}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, 
> None)], schema=['age', 'height', 'name'])
> df.show()
> +---+------+-----+
> |age|height| name|
> +---+------+-----+
> |  5|    80|Alice|
> | 10|  null|  Bob|
> | 15|    80| null|
> +---+------+-----+
> # this works just fine
> df.dropna(subset=['name']).show()
> +---+------+-----+
> |age|height| name|
> +---+------+-----+
> |  5|    80|Alice|
> | 10|  null|  Bob|
> +---+------+-----+
> # now add a struct column
> df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 
> 'name'))
> df_with_struct.show(truncate=False)
> +---+------+-----+--------------+
> |age|height|name |struct_col    |
> +---+------+-----+--------------+
> |5  |80    |Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]    |
> |15 |80    |null |[15, 80,]     |
> +---+------+-----+--------------+
> # now dropna drops the whole dataframe when you use struct_col
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+------+----+----------+
> |age|height|name|struct_col|
> +---+------+----+----------+
> +---+------+----+----------+
> {code}
>  I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 
> with python 3.6.8 and in both, the result looks like:
> {code:python}
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+------+-----+--------------+
> |age|height|name |struct_col    |
> +---+------+-----+--------------+
> |5  |80    |Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]    |
> +---+------+-----+--------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-31256) Dropna doesn't work for struct columns

Reply via email to