Hello everyone, Following the creation of this PR <https://github.com/apache/spark/pull/50714> and the discussion in the thread. What do you think about the behavior described here:
When using PySpark DataFrame.dropDuplicates with an empty array as the > subset argument, the resulting DataFrame contains a single row (the first > row). This behavior is different than using DataFrame.dropDuplicates > without any parameters or with None as the subset argument. I would > expect that passing an empty array to dropDuplicates would use all the > columns to detect duplicates and remove them. > This behavior is the same on the Scala side where df.dropDuplicates(Seq.empty) returns the first row. Would it make sense to change the behavior of df.dropDuplicates(Seq.empty) to be the same as df.dropDuplicates() ? Cheers, David