GitHub user rxin opened a pull request:
https://github.com/apache/spark/pull/16925
[SPARK-16475][SQL] Broadcast Hint for SQL Queries
## What changes were proposed in this pull request?
This PR aims to achieve the following two goals in Spark SQL.
1. Generic Hint Syntax
The generic hints are parsed and transformed into concrete hints by
SubstituteHints of Analyzer. The unknown hints are removed, too. For example,
Hint("MAPJOIN") is transformed into BroadcastJoin and other hints are removed
currently.
```
SELECT /*+ MAPJOIN(t) */ * FROM t
SELECT /*+ STREAMTABLE(a,b,c) */ * FROM t
SELECT /*+ INDEX(t emp_job_ix) */ * FROM t
```
Unlink Hive, NEWMAPJOIN(t) is allowed for accepting new Spark Hints.
2. Broadcast Hints
The followings are recognized. Technically, broadcast hints are matched
UnresolvedRelation to support Hive MetastoreRelation. The style of
database_name.table_name is not allowed in this PR.
```
SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = u.id
SELECT /*+ BROADCAST(u) */ * FROM t JOIN u ON t.id = u.id
SELECT /*+ BROADCASTJOIN(u) */ * FROM t JOIN u ON t.id = u.id
```
Examples
```
scala> spark.range(1000000000).createOrReplaceTempView("t")
scala> spark.range(1000000000).createOrReplaceTempView("u")
scala> sql("SELECT * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#4L], Inner
:- *Sort [id#0L ASC], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *Range (0, 1000000000, splits=8)
+- *Sort [id#4L ASC], false, 0
+- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
scala> sql("SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id =
u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint,
false]))
: +- *Range (0, 1000000000, splits=8)
+- *Range (0, 1000000000, splits=8)
scala> sql("SELECT /*+ MAPJOIN(u) */ * FROM t JOIN u ON t.id =
u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
:- *Range (0, 1000000000, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint,
false]))
+- *Range (0, 1000000000, splits=8)
scala> sql("CREATE TABLE hive_t(id INT)")
res5: org.apache.spark.sql.DataFrame = []
scala> sql("CREATE TABLE hive_u(id INT)")
res6: org.apache.spark.sql.DataFrame = []
scala> sql("SELECT /*+ MAPJOIN(hive_u) */ * FROM hive_t JOIN hive_u ON
hive_t.id = hive_u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#28], [id#29], Inner, BuildRight
:- *Filter isnotnull(id#28)
: +- HiveTableScan [id#28], MetastoreRelation default, hive_t
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int,
false] as bigint)))
+- *Filter isnotnull(id#29)
+- HiveTableScan [id#29], MetastoreRelation default, hive_u
scala> sql("SELECT * FROM hive_t JOIN hive_u ON hive_t.id =
hive_u.id").explain
== Physical Plan ==
*SortMergeJoin [id#36], [id#37], Inner
:- *Sort [id#36 ASC], false, 0
: +- Exchange hashpartitioning(id#36, 200)
: +- *Filter isnotnull(id#36)
: +- HiveTableScan [id#36], MetastoreRelation default, hive_t
+- *Sort [id#37 ASC], false, 0
+- Exchange hashpartitioning(id#37, 200)
+- *Filter isnotnull(id#37)
+- HiveTableScan [id#37], MetastoreRelation default, hive_u
```
This patch is based on the work done in
https://github.com/apache/spark/pull/14426.
## How was this patch tested?
Added a new unit test suite for the broadcast hint rule
(SubstituteHintsSuite) and new test cases for parser change (in
PlanParserSuite). Also added end-to-end test suites.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rxin/spark SPARK-16475-broadcast-hint
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16925.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16925
----
commit 539782d93af14ed27c6f3a0fc659b13c4f92da41
Author: Dongjoon Hyun <[email protected]>
Date: 2016-07-11T08:04:51Z
[SPARK-16475][SQL] Broadcast Hint for SQL Queries
commit 318bc031985d178cb0ae7612cbee4d62636916b3
Author: Reynold Xin <[email protected]>
Date: 2017-02-14T12:16:55Z
Merge pull request #14426 from dongjoon-hyun/SPARK-16475-HINT
[SPARK-16475][SQL] Broadcast Hint for SQL Queries
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]