GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/7379

    [SPARK-8682][SQL][WIP] Range Join

    *...copied from JIRA (SPARK-8682):*
    
    Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered 
Cartesian Join) when it has to execute the following range query:
    ```
    SELECT A.*,
           B.*
    FROM   tableA A
           JOIN tableB B
            ON A.start <= B.end
             AND A.end > B.start
    ```
    This is horribly inefficient. The performance of this query can be greatly 
improved, when one of the tables can be broadcasted, by creating a range index. 
A range index is basically a sorted map containing the rows of the smaller 
table, indexed by both the high and low keys. using this structure the 
complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = number 
of records in the larger table, M = number of records in the smaller (indexed) 
table.
    
    This is currently a work in progress. I will be adding more tests and a 
small benchmark in the next couple of days. If you want to try this out, set 
the ```spark.sql.planner.rangeJoin``` option to ```true``` in the SQL 
configuration.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark SPARK-8682

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7379
    
----
commit a2ff5dd2c54ca00784fc529bea2da2f05897786b
Author: Herman van Hovell <[email protected]>
Date:   2015-07-01T02:17:03Z

    Initial Range Join commit: Compiles & Style Checks work.

commit d2bd7932a2f15a41e39aca1d9fad3441b85fae44
Author: Herman van Hovell <[email protected]>
Date:   2015-07-13T22:27:03Z

    Added Tests for Range Index. Ton of Bug Fixes.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to