GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/11188
[SPARK-13314] [SQL] Fix worst case of broadcast join of two ints
If the two join columns have the same value, the hash code of them will be
(a ^ b), which is 0, then the HashMap will be very very slow.
This PR will rotate the second int to avoid this case. In theory, it's
still have the possibility that has lots of collisions, the pattern will be (1,
131072), (2, 131073) ... (n, n + 131072).
This PR also added some micro benchmark, and updated the results for
broadcast hash joins.
This PR is based on #11130
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark fix_ints
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11188.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11188
----
commit 52efe91168a4be7ce721d2f56e2b1e7aab9379db
Author: Davies Liu <[email protected]>
Date: 2016-02-09T00:33:44Z
generated broadcast outer join
commit 9525782c971f52c1343830402276086cd0e4ae8f
Author: Davies Liu <[email protected]>
Date: 2016-02-09T18:18:45Z
refactor
commit 9a1f5325e954d8464d28ebf415c9dca665e15d35
Author: Davies Liu <[email protected]>
Date: 2016-02-09T18:21:31Z
fix style
commit 98cda0be6cdc3687d045aa5b881676758d66842c
Author: Davies Liu <[email protected]>
Date: 2016-02-09T18:32:35Z
Merge branch 'master' into gen_out
commit edbc284921281358a38b300218ff288c33cdc3b4
Author: Davies Liu <[email protected]>
Date: 2016-02-09T19:48:58Z
fix tests
commit da45df1536f112a14bfe15d6d30d307cdbd99d5b
Author: Davies Liu <[email protected]>
Date: 2016-02-10T19:20:33Z
address comments
commit 9b05c7cd335f06079c241f681282bd36306dc739
Author: Davies Liu <[email protected]>
Date: 2016-02-12T18:26:24Z
Merge branch 'master' of github.com:apache/spark into gen_out
Conflicts:
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashOuterJoin.scala
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
commit 1c0ee96e80d5cc1909d7d5ec794b74e76979ae45
Author: Davies Liu <[email protected]>
Date: 2016-02-12T20:40:30Z
fix worst case of broadcast join with two ints
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]