GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/11188

    [SPARK-13314] [SQL] Fix worst case of broadcast join of two ints

    If the two join columns have the same value, the hash code of them will be 
(a ^ b), which is 0, then the HashMap will be very very slow.
    
    This PR will rotate the second int to avoid this case. In theory, it's 
still have the possibility that has lots of collisions, the pattern will be (1, 
131072), (2, 131073) ... (n, n + 131072).
    
    This PR also added some micro benchmark, and updated the results for 
broadcast hash joins.
    
    This PR is based on #11130 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark fix_ints

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11188
    
----
commit 52efe91168a4be7ce721d2f56e2b1e7aab9379db
Author: Davies Liu <[email protected]>
Date:   2016-02-09T00:33:44Z

    generated broadcast outer join

commit 9525782c971f52c1343830402276086cd0e4ae8f
Author: Davies Liu <[email protected]>
Date:   2016-02-09T18:18:45Z

    refactor

commit 9a1f5325e954d8464d28ebf415c9dca665e15d35
Author: Davies Liu <[email protected]>
Date:   2016-02-09T18:21:31Z

    fix style

commit 98cda0be6cdc3687d045aa5b881676758d66842c
Author: Davies Liu <[email protected]>
Date:   2016-02-09T18:32:35Z

    Merge branch 'master' into gen_out

commit edbc284921281358a38b300218ff288c33cdc3b4
Author: Davies Liu <[email protected]>
Date:   2016-02-09T19:48:58Z

    fix tests

commit da45df1536f112a14bfe15d6d30d307cdbd99d5b
Author: Davies Liu <[email protected]>
Date:   2016-02-10T19:20:33Z

    address comments

commit 9b05c7cd335f06079c241f681282bd36306dc739
Author: Davies Liu <[email protected]>
Date:   2016-02-12T18:26:24Z

    Merge branch 'master' of github.com:apache/spark into gen_out
    
    Conflicts:
        
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala
        
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala
        
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashOuterJoin.scala
        
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

commit 1c0ee96e80d5cc1909d7d5ec794b74e76979ae45
Author: Davies Liu <[email protected]>
Date:   2016-02-12T20:40:30Z

    fix worst case of broadcast join with two ints

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to