Thanks a lot young for explanation. But its sounds like an API behaviour change. For now I do the counts != o on both dataframes before these operations. Not good from performance point of view hence have created a JIRA (SPARK-20008) to track it.
Thanks, Ravindra. On Fri, Mar 17, 2017 at 8:51 PM Yong Zhang <java8...@hotmail.com> wrote: > Starting from Spark 2, these kind of operation are implemented in left > anti join, instead of using RDD operation directly. > > > Same issue also on sqlContext. > > > scala> spark.version > res25: String = 2.0.2 > > > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, *LeftAnti*, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > > This arguably means a bug. But my guess is liking the logic of comparing > NULL = NULL, should it return true or false, causing this kind of > confusion. > > Yong > > ------------------------------ > *From:* Ravindra <ravindra.baj...@gmail.com> > *Sent:* Friday, March 17, 2017 4:30 AM > *To:* user@spark.apache.org > *Subject:* Spark 2.0.2 - > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() > > Can someone please explain why > > println ( " Empty count " + > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() > > *prints* - Empty count 1 > > This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and > found this. This causes my tests to fail. Is there another way to check > full equality of 2 dataframes. > > Thanks, > Ravindra. >