[GitHub] spark pull request #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

dongjoon-hyun Sat, 30 Jul 2016 16:58:33 -0700

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/14426


    [SPARK-16475][SQL] Broadcast Hint for SQL Queries

    ## What changes were proposed in this pull request?
    This PR aims to achieve the following two goals in Spark SQL.
    
    **1. Generic Hint Syntax**
    The generic hints are parsed and transformed into concrete hints by 
`SubstituteHints` of **Analyzer**. The unknown hints are removed, too. For 
example, `Hint("MAPJOIN")` is transformed into `BroadcastJoin` and other hints 
are removed currently.
    
    ```sql
    SELECT /*+ MAPJOIN(t) */ * FROM t
    SELECT /*+ STREAMTABLE(a,b,c) */ * FROM t
    SELECT /*+ INDEX(t emp_job_ix) */ * FROM t
    ```
    Unlink Hive, `NEWMAPJOIN(t)` is allowed for accepting new Spark Hints.
    
    
    **2. Broadcast Hints**
    The followings are recognized. Technically, broadcast hints are matched 
`UnresolvedRelation` to support Hive `MetastoreRelation`. The style of 
`database_name.table_name` is not allowed in this PR.
    ```sql
    SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = u.id
    SELECT /*+ BROADCAST(u) */ * FROM t JOIN u ON t.id = u.id
    SELECT /*+ BROADCASTJOIN(u) */ * FROM t JOIN u ON t.id = u.id
    ```
    
    **Examples**
    ```scala
    scala> spark.range(1000000000).createOrReplaceTempView("t")
    scala> spark.range(1000000000).createOrReplaceTempView("u")
    
    scala> sql("SELECT * FROM t JOIN u ON t.id = u.id").explain
    == Physical Plan ==
    *SortMergeJoin [id#0L], [id#4L], Inner
    :- *Sort [id#0L ASC], false, 0
    :  +- Exchange hashpartitioning(id#0L, 200)
    :     +- *Range (0, 1000000000, splits=8)
    +- *Sort [id#4L ASC], false, 0
       +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
    
    scala> sql("SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = 
u.id").explain
    == Physical Plan ==
    *BroadcastHashJoin [id#0L], [id#4L], Inner, BuildLeft
    :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false]))
    :  +- *Range (0, 1000000000, splits=8)
    +- *Range (0, 1000000000, splits=8)
    
    scala> sql("SELECT /*+ MAPJOIN(u) */ * FROM t JOIN u ON t.id = 
u.id").explain
    == Physical Plan ==
    *BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
    :- *Range (0, 1000000000, splits=8)
    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false]))
       +- *Range (0, 1000000000, splits=8)
    
    scala> sql("CREATE TABLE hive_t(id INT)")
    res5: org.apache.spark.sql.DataFrame = []
    
    scala> sql("CREATE TABLE hive_u(id INT)")
    res6: org.apache.spark.sql.DataFrame = []
    
    scala> sql("SELECT /*+ MAPJOIN(hive_u) */ * FROM hive_t JOIN hive_u ON 
hive_t.id = hive_u.id").explain
    == Physical Plan ==
    *BroadcastHashJoin [id#28], [id#29], Inner, BuildRight
    :- *Filter isnotnull(id#28)
    :  +- HiveTableScan [id#28], MetastoreRelation default, hive_t
    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)))
       +- *Filter isnotnull(id#29)
          +- HiveTableScan [id#29], MetastoreRelation default, hive_u
    
    scala> sql("SELECT * FROM hive_t JOIN hive_u ON hive_t.id = 
hive_u.id").explain
    == Physical Plan ==
    *SortMergeJoin [id#36], [id#37], Inner
    :- *Sort [id#36 ASC], false, 0
    :  +- Exchange hashpartitioning(id#36, 200)
    :     +- *Filter isnotnull(id#36)
    :        +- HiveTableScan [id#36], MetastoreRelation default, hive_t
    +- *Sort [id#37 ASC], false, 0
       +- Exchange hashpartitioning(id#37, 200)
          +- *Filter isnotnull(id#37)
             +- HiveTableScan [id#37], MetastoreRelation default, hive_u
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins tests with new testcases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-16475-HINT

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14426.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14426
    
----
commit ee8bb1471b610d634893c145eeaada3045bcaccd
Author: Dongjoon Hyun <dongj...@apache.org>
Date:   2016-07-11T08:04:51Z

    [SPARK-16475][SQL] Broadcast Hint for SQL Queries

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

Reply via email to