[
https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083659#comment-17083659
]
Sunitha Kambhampati commented on SPARK-31256:
---------------------------------------------
I can repro the issue using the Scala api on trunk.
It looks like SPARK-30065 explicitly removed the support for nested column
resolution in drop. The change went into trunk and as well as the 2.4.5
branch.
E.g in the test:
https://github.com/apache/spark/blame/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala#L302
cc [~cloud_fan], [~imback82] This seems to be a regression. Is there a
reason to remove this behavior?
> Dropna doesn't work for struct columns
> --------------------------------------
>
> Key: SPARK-31256
> URL: https://issues.apache.org/jira/browse/SPARK-31256
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.5
> Environment: Spark 2.4.5
> Python 3.7.4
> Reporter: Michael Souder
> Priority: Major
>
> Dropna using a subset with a column from a struct drops the entire data frame.
> {code:python}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80,
> None)], schema=['age', 'height', 'name'])
> df.show()
> +---+------+-----+
> |age|height| name|
> +---+------+-----+
> | 5| 80|Alice|
> | 10| null| Bob|
> | 15| 80| null|
> +---+------+-----+
> # this works just fine
> df.dropna(subset=['name']).show()
> +---+------+-----+
> |age|height| name|
> +---+------+-----+
> | 5| 80|Alice|
> | 10| null| Bob|
> +---+------+-----+
> # now add a struct column
> df_with_struct = df.withColumn('struct_col', F.struct('age', 'height',
> 'name'))
> df_with_struct.show(truncate=False)
> +---+------+-----+--------------+
> |age|height|name |struct_col |
> +---+------+-----+--------------+
> |5 |80 |Alice|[5, 80, Alice]|
> |10 |null |Bob |[10,, Bob] |
> |15 |80 |null |[15, 80,] |
> +---+------+-----+--------------+
> # now dropna drops the whole dataframe when you use struct_col
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+------+----+----------+
> |age|height|name|struct_col|
> +---+------+----+----------+
> +---+------+----+----------+
> {code}
> I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1
> with python 3.6.8 and in both, the result looks like:
> {code:python}
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+------+-----+--------------+
> |age|height|name |struct_col |
> +---+------+-----+--------------+
> |5 |80 |Alice|[5, 80, Alice]|
> |10 |null |Bob |[10,, Bob] |
> +---+------+-----+--------------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]