GitHub user wangxiaojing opened a pull request:

    https://github.com/apache/spark/pull/3442

    [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

    JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570)
    We are planning to create a `BroadcastLeftSemiJoinHash` to implement the 
broadcast join for `left semijoin`
    In left semijoin :
    If the size of data from right side is smaller than the user-settable 
threshold `AUTO_BROADCASTJOIN_THRESHOLD`, 
    the planner would mark it as the `broadcast` relation and mark the other 
relation as the stream side. The broadcast table will be broadcasted to all of 
the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` 
object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use 
`joins.LeftSemiJoinHash`.
    
    The benchmark suggests these  made the optimized version 4x faster  when 
`left semijoin` 
    <pre><code>
    Original:
    left semi join : 9288 ms 
    Optimized:
    left semi join : 1963 ms 
    </code></pre>
    The micro benchmark load `data1/kv3.txt` into a normal Hive table.
    Benchmark code:
    <pre><code>
     def benchmark(f: => Unit) = {
        val begin = System.currentTimeMillis()
        f
        val end = System.currentTimeMillis()
        end - begin
      }
      val sc = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
      val hiveContext = new HiveContext(sc)
      import hiveContext._
      sql("drop table if exists left_table")
      sql("drop table if exists right_table")
      sql( """create table left_table (key int, value string)
           """.stripMargin)
      sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""")
      sql( """create table right_table (key int, value string)
           """.stripMargin)
      sql(
        """
          |from left_table
          |insert overwrite table right_table
          |select left_table.key, left_table.value
        """.stripMargin)
    
      val leftSimeJoin = sql(
        """select a.key from left_table a
          |left semi join right_table b on a.key = b.key""".stripMargin)
      val leftSemiJoinDuration = benchmark(leftSimeJoin.count())
      println(s"left semi join : $leftSemiJoinDuration ms ")
    </code></pre>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangxiaojing/spark SPARK-4570

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3442
    
----
commit 5d58772aa0bd7fd55a9b9495efbff5cc0b36aeae
Author: wangxiaojing <[email protected]>
Date:   2014-11-25T04:04:05Z

    add BroadcastLeftSemiJoinHash

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to