[jira] [Comment Edited] (SPARK-51262) exceptAll not working with drop_duplicates using subset

Shrirang Mhalgi (Jira) Fri, 15 May 2026 09:48:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-51262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081235#comment-18081235
 ]


Shrirang Mhalgi edited comment on SPARK-51262 at 5/15/26 4:47 PM:
------------------------------------------------------------------

I was able to reproduce this on master branch. The issue occurs when 
{{exceptAll}} (which uses {{RewriteExceptAll}} optimizer rule) runs on a 
DataFrame produced by {{{}dropDuplicates(subset){}}}. The root cause is an 
attribute reference mismatch in the optimized plan. 

 

Working on a fix.

 

Repro (Scala):

> val df1 = spark.createDataFrame(Seq((1, "a", 100), (1, "a", 200), (2, "b", 
> 300)))
>   .toDF("id", "name", "value")
> val df2 = spark.createDataFrame(Seq((1, "a", 100))).toDF("id", "name", 
> "value")
> df1.dropDuplicates("id", "name").exceptAll(df2).count()
> // Throws: INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND
> ```


was (Author: JIRAUSER313104):
I was able to reproduce this on master (branch-4.0). The issue occurs when 
{{exceptAll}} (which uses {{RewriteExceptAll}} optimizer rule) runs on a 
DataFrame produced by {{{}dropDuplicates(subset){}}}. The root cause is an 
attribute reference mismatch in the optimized plan. 

 

Working on a fix.

 

Repro (Scala):

> val df1 = spark.createDataFrame(Seq((1, "a", 100), (1, "a", 200), (2, "b", 
> 300)))
>   .toDF("id", "name", "value")
> val df2 = spark.createDataFrame(Seq((1, "a", 100))).toDF("id", "name", 
> "value")
> df1.dropDuplicates("id", "name").exceptAll(df2).count()
> // Throws: INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND
> ```

> exceptAll not working with drop_duplicates using subset
> -------------------------------------------------------
>
>                 Key: SPARK-51262
>                 URL: https://issues.apache.org/jira/browse/SPARK-51262
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.5.0, 3.5.3
>            Reporter: Nicolau Balbino
>            Priority: Minor
>              Labels: SQL, pull-request-available
>
> When using drop_duplicate with subset and after use exceptAll method, when 
> calling some action (isEmpty, show, collect, count) raises a Py4J error. 
> Searching web, this issues is related here: 
> [https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-39612,] 
> also marked as resolved.
> I tested locally with version 3.5.3 and also AWS Glue 5.0, using 3.5.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-51262) exceptAll not working with drop_duplicates using subset

Reply via email to