Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

David Kunzmann Wed, 14 May 2025 00:21:27 -0700

Hi James,
I see how the behavior makes sense now, but I was wondering why a user
would do this intentionally instead of using head() or first().
I thought it could mainly be done by mistake, as there is no benefit from
using  df.dropDuplicates(Seq.empty) .


On Fri, May 9, 2025 at 8:50 PM James Willis <[email protected]> wrote:

> This seems like the correct behavior to me. Every value of the null set of
> columns will match between any pair of Rows.
>
>
>
> On Thu, May 8, 2025 at 11:37 AM David Kunzmann <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>> Following the creation of this PR
>> <https://github.com/apache/spark/pull/50714> and the discussion in the
>> thread. What do you think about the behavior described here:
>>
>> When using PySpark DataFrame.dropDuplicates with an empty array as the
>>> subset argument, the resulting DataFrame contains a single row (the
>>> first row). This behavior is different than using
>>> DataFrame.dropDuplicates without any parameters or with None as the
>>> subset argument. I would expect that passing an empty array to
>>> dropDuplicates would use all the columns to detect duplicates and
>>> remove them.
>>>
>>
>>
>> This behavior is the same on the Scala side where
>> df.dropDuplicates(Seq.empty)  returns the first row.
>>
>> Would it make sense to change the behavior of
>> df.dropDuplicates(Seq.empty) to be the same as df.dropDuplicates() ?
>>
>> Cheers,
>>
>> David
>>
>>

Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

Reply via email to