[ https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083666#comment-17083666 ]
Terry Kim commented on SPARK-31256: ----------------------------------- Let me look into this. > Dropna doesn't work for struct columns > -------------------------------------- > > Key: SPARK-31256 > URL: https://issues.apache.org/jira/browse/SPARK-31256 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.5 > Environment: Spark 2.4.5 > Python 3.7.4 > Reporter: Michael Souder > Priority: Major > > Dropna using a subset with a column from a struct drops the entire data frame. > {code:python} > import pyspark.sql.functions as F > df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, > None)], schema=['age', 'height', 'name']) > df.show() > +---+------+-----+ > |age|height| name| > +---+------+-----+ > | 5| 80|Alice| > | 10| null| Bob| > | 15| 80| null| > +---+------+-----+ > # this works just fine > df.dropna(subset=['name']).show() > +---+------+-----+ > |age|height| name| > +---+------+-----+ > | 5| 80|Alice| > | 10| null| Bob| > +---+------+-----+ > # now add a struct column > df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', > 'name')) > df_with_struct.show(truncate=False) > +---+------+-----+--------------+ > |age|height|name |struct_col | > +---+------+-----+--------------+ > |5 |80 |Alice|[5, 80, Alice]| > |10 |null |Bob |[10,, Bob] | > |15 |80 |null |[15, 80,] | > +---+------+-----+--------------+ > # now dropna drops the whole dataframe when you use struct_col > df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False) > +---+------+----+----------+ > |age|height|name|struct_col| > +---+------+----+----------+ > +---+------+----+----------+ > {code} > I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 > with python 3.6.8 and in both, the result looks like: > {code:python} > df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False) > +---+------+-----+--------------+ > |age|height|name |struct_col | > +---+------+-----+--------------+ > |5 |80 |Alice|[5, 80, Alice]| > |10 |null |Bob |[10,, Bob] | > +---+------+-----+--------------+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org