Thanks a lot young for explanation. But its sounds like an API behaviour
change. For now I do the counts != o on both dataframes before these
operations. Not good from performance point of view hence have created a
JIRA (SPARK-20008) to track it.

Thanks,
Ravindra.

On Fri, Mar 17, 2017 at 8:51 PM Yong Zhang <java8...@hotmail.com> wrote:

> Starting from Spark 2, these kind of operation are implemented in left
> anti join, instead of using RDD operation directly.
>
>
> Same issue also on sqlContext.
>
>
> scala> spark.version
> res25: String = 2.0.2
>
>
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
>
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>    +- *HashAggregate(keys=[], functions=[], output=[])
>       +- BroadcastNestedLoopJoin BuildRight, *LeftAnti*, false
>          :- Scan ExistingRDD[]
>          +- BroadcastExchange IdentityBroadcastMode
>             +- Scan ExistingRDD[]
>
> This arguably means a bug. But my guess is liking the logic of comparing
> NULL = NULL, should it return true or false, causing this kind of
> confusion.
>
> Yong
>
> ------------------------------
> *From:* Ravindra <ravindra.baj...@gmail.com>
> *Sent:* Friday, March 17, 2017 4:30 AM
> *To:* user@spark.apache.org
> *Subject:* Spark 2.0.2 -
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
>
> Can someone please explain why
>
> println ( " Empty count " +
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
>
> *prints* -  Empty count 1
>
> This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and
> found this. This causes my tests to fail. Is there another way to check
> full equality of 2 dataframes.
>
> Thanks,
> Ravindra.
>

Reply via email to