David Kunzmann created SPARK-51710:
--------------------------------------
Summary: Using Dataframe.dropDuplicates with an empty array as
argument behaves unexpectedly
Key: SPARK-51710
URL: https://issues.apache.org/jira/browse/SPARK-51710
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.5.5
Reporter: David Kunzmann
When using PySpark DataFrame.dropDuplicates with an empty array as the subset
argument, the resulting DataFrame contains a single row (the first row). This
behavior is different than using DataFrame.dropDuplicates without any
parameters or with None as the subset argument.
{code:java}
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [
(1, "Alice"),
(2, "Bob"),
(3, "Alice"),
(3, "Alice"),
(2, "Bob")
]
df = spark.createDataFrame(data, ["id", "name"])
df_dedup = df.dropDuplicates([])
df_dedup.show()
{code}
The above snippet will show the following DataFrame:
{code:java}
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
+---+-----+ {code}
I would expect the behavior to be the same as df.dropDuplicates() which returns:
{code:java}
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Alice|
+---+-----+ {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]