[
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935540#comment-15935540
]
Hyukjin Kwon commented on SPARK-20008:
--------------------------------------
[~smilegator], it seems the discussion is about deuplicates in the result if I
understood correctly.
The problem here is {{Set() - Set()}} should return empty {{Set()}} which was
previously done
However, it seems now returning {{Set(Row())}} from empty dataframes.
In the current master,
{code}
scala> spark.emptyDataFrame.except(spark.emptyDataFrame).collect()
res0: Array[org.apache.spark.sql.Row] = Array([])
scala> spark.emptyDataFrame.collect()
res1: Array[org.apache.spark.sql.Row] = Array()
{code}
I thought S∖S=∅ as below:
{code}
scala> spark.range(1).except(spark.range(1)).collect()
res14: Array[Long] = Array()
{code}
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns
> 1
> -------------------------------------------------------------------------------
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.2, 2.2.0
> Reporter: Ravindra Bajpai
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage
> point of view and hence I consider this as a bug. May be a boundary case, not
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail -
> Starting from Spark 2, these kind of operation are implemented in left anti
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
> +- *HashAggregate(keys=[], functions=[], output=[])
> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
> :- Scan ExistingRDD[]
> +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL
> = NULL, should it return true or false, causing this kind of confusion.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]