[GitHub] [spark] c21 commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

GitBox Mon, 20 Sep 2021 22:01:38 -0700


c21 commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r712695659




##########
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##########
@@ -1402,4 +1402,28 @@ class JoinSuite extends QueryTest with 
SharedSparkSession with AdaptiveSparkPlan
       assertJoin(sql, classOf[ShuffledHashJoinExec])
     }
   }
+
+  test("SPARK-36794: Ignore duplicated key when building relation for 
semi/anti hash join") {

Review comment:
       > but this test passes with and without the patch, right?
   
   Yes. This unit test is added mostly to verify after adding this PR, the join 
still works as expected. And this PR is not fixing a regression, but just an 
improvement.
   
   > Seems there isn't a way to show the difference?
   
   Well in theory we can add more code to check the number of rows inside 
`HashedRelation`, and this should have a difference before/after this PR. 
However this would need more code change, e.g. introducing a new SQL metrics 
for number of rows inside `HashedRelation`, which looks like not unnecessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Reply via email to