GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/7379
[SPARK-8682][SQL][WIP] Range Join
*...copied from JIRA (SPARK-8682):*
Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered
Cartesian Join) when it has to execute the following range query:
```
SELECT A.*,
B.*
FROM tableA A
JOIN tableB B
ON A.start <= B.end
AND A.end > B.start
```
This is horribly inefficient. The performance of this query can be greatly
improved, when one of the tables can be broadcasted, by creating a range index.
A range index is basically a sorted map containing the rows of the smaller
table, indexed by both the high and low keys. using this structure the
complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = number
of records in the larger table, M = number of records in the smaller (indexed)
table.
This is currently a work in progress. I will be adding more tests and a
small benchmark in the next couple of days. If you want to try this out, set
the ```spark.sql.planner.rangeJoin``` option to ```true``` in the SQL
configuration.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hvanhovell/spark SPARK-8682
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7379.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7379
----
commit a2ff5dd2c54ca00784fc529bea2da2f05897786b
Author: Herman van Hovell <[email protected]>
Date: 2015-07-01T02:17:03Z
Initial Range Join commit: Compiles & Style Checks work.
commit d2bd7932a2f15a41e39aca1d9fad3441b85fae44
Author: Herman van Hovell <[email protected]>
Date: 2015-07-13T22:27:03Z
Added Tests for Range Index. Ton of Bug Fixes.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]