So you are basically saying df.dropDuplicates(Seq.empty) should be the same
as df.dropDuplicates(all_columns). I think this is a reasonable change, as
the previous behavior doesn't make sense which always returns the first
row. For safety, we can add a legacy config for fallback and mention it in
the migration guide.

On Wed, May 14, 2025 at 9:21 AM David Kunzmann <davidkunzm...@gmail.com>
wrote:

> Hi James,
> I see how the behavior makes sense now, but I was wondering why a user
> would do this intentionally instead of using head() or first().
> I thought it could mainly be done by mistake, as there is no benefit from
> using  df.dropDuplicates(Seq.empty) .
>
> On Fri, May 9, 2025 at 8:50 PM James Willis <ja...@wherobots.com> wrote:
>
>> This seems like the correct behavior to me. Every value of the null set
>> of columns will match between any pair of Rows.
>>
>>
>>
>> On Thu, May 8, 2025 at 11:37 AM David Kunzmann <davidkunzm...@gmail.com>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> Following the creation of this PR
>>> <https://github.com/apache/spark/pull/50714> and the discussion in the
>>> thread. What do you think about the behavior described here:
>>>
>>> When using PySpark DataFrame.dropDuplicates with an empty array as the
>>>> subset argument, the resulting DataFrame contains a single row (the
>>>> first row). This behavior is different than using
>>>> DataFrame.dropDuplicates without any parameters or with None as the
>>>> subset argument. I would expect that passing an empty array to
>>>> dropDuplicates would use all the columns to detect duplicates and
>>>> remove them.
>>>>
>>>
>>>
>>> This behavior is the same on the Scala side where
>>> df.dropDuplicates(Seq.empty)  returns the first row.
>>>
>>> Would it make sense to change the behavior of
>>> df.dropDuplicates(Seq.empty) to be the same as df.dropDuplicates() ?
>>>
>>> Cheers,
>>>
>>> David
>>>
>>>

Reply via email to