[DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

David Kunzmann Thu, 08 May 2025 14:36:20 -0700

 Hello everyone,

Following the creation of this PR
<https://github.com/apache/spark/pull/50714> and the discussion in the
thread. What do you think about the behavior described here:


When using PySpark DataFrame.dropDuplicates with an empty array as the
> subset argument, the resulting DataFrame contains a single row (the first
> row). This behavior is different than using DataFrame.dropDuplicates
> without any parameters or with None as the subset argument. I would
> expect that passing an empty array to dropDuplicates would use all the
> columns to detect duplicates and remove them.
>


This behavior is the same on the Scala side where
df.dropDuplicates(Seq.empty)  returns the first row.

Would it make sense to change the behavior of df.dropDuplicates(Seq.empty)
to be the same as df.dropDuplicates() ?

Cheers,

David

[DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

Reply via email to