Hi James, I see how the behavior makes sense now, but I was wondering why a user would do this intentionally instead of using head() or first(). I thought it could mainly be done by mistake, as there is no benefit from using df.dropDuplicates(Seq.empty) .
On Fri, May 9, 2025 at 8:50 PM James Willis <ja...@wherobots.com> wrote: > This seems like the correct behavior to me. Every value of the null set of > columns will match between any pair of Rows. > > > > On Thu, May 8, 2025 at 11:37 AM David Kunzmann <davidkunzm...@gmail.com> > wrote: > >> Hello everyone, >> >> Following the creation of this PR >> <https://github.com/apache/spark/pull/50714> and the discussion in the >> thread. What do you think about the behavior described here: >> >> When using PySpark DataFrame.dropDuplicates with an empty array as the >>> subset argument, the resulting DataFrame contains a single row (the >>> first row). This behavior is different than using >>> DataFrame.dropDuplicates without any parameters or with None as the >>> subset argument. I would expect that passing an empty array to >>> dropDuplicates would use all the columns to detect duplicates and >>> remove them. >>> >> >> >> This behavior is the same on the Scala side where >> df.dropDuplicates(Seq.empty) returns the first row. >> >> Would it make sense to change the behavior of >> df.dropDuplicates(Seq.empty) to be the same as df.dropDuplicates() ? >> >> Cheers, >> >> David >> >>