So you are basically saying df.dropDuplicates(Seq.empty) should be the same as df.dropDuplicates(all_columns). I think this is a reasonable change, as the previous behavior doesn't make sense which always returns the first row. For safety, we can add a legacy config for fallback and mention it in the migration guide.
On Wed, May 14, 2025 at 9:21 AM David Kunzmann <davidkunzm...@gmail.com> wrote: > Hi James, > I see how the behavior makes sense now, but I was wondering why a user > would do this intentionally instead of using head() or first(). > I thought it could mainly be done by mistake, as there is no benefit from > using df.dropDuplicates(Seq.empty) . > > On Fri, May 9, 2025 at 8:50 PM James Willis <ja...@wherobots.com> wrote: > >> This seems like the correct behavior to me. Every value of the null set >> of columns will match between any pair of Rows. >> >> >> >> On Thu, May 8, 2025 at 11:37 AM David Kunzmann <davidkunzm...@gmail.com> >> wrote: >> >>> Hello everyone, >>> >>> Following the creation of this PR >>> <https://github.com/apache/spark/pull/50714> and the discussion in the >>> thread. What do you think about the behavior described here: >>> >>> When using PySpark DataFrame.dropDuplicates with an empty array as the >>>> subset argument, the resulting DataFrame contains a single row (the >>>> first row). This behavior is different than using >>>> DataFrame.dropDuplicates without any parameters or with None as the >>>> subset argument. I would expect that passing an empty array to >>>> dropDuplicates would use all the columns to detect duplicates and >>>> remove them. >>>> >>> >>> >>> This behavior is the same on the Scala side where >>> df.dropDuplicates(Seq.empty) returns the first row. >>> >>> Would it make sense to change the behavior of >>> df.dropDuplicates(Seq.empty) to be the same as df.dropDuplicates() ? >>> >>> Cheers, >>> >>> David >>> >>>