[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

liancheng Tue, 13 May 2014 10:57:14 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/758


    [SPARK-1368][SQL] Optimized HiveTableScan

    JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)
    
    This PR introduces two major updates:
    
    - Replaced FP style code with `while` loop and reusable `GenericMutableRow` 
object in critical path of `HiveTableScan`.
    - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column 
pruning.
    
    My quick micro benchmark suggests these two optimizations made the 
optimized version 2x and 2.5x faster when scanning CSV table and RCFile table 
respectively:
    
    ```
    Original:
    
    [info] CSV: 27676 ms, RCFile: 26415 ms
    [info] CSV: 27703 ms, RCFile: 26029 ms
    [info] CSV: 27511 ms, RCFile: 25962 ms
    
    Optimized:
    
    [info] CSV: 13820 ms, RCFile: 10402 ms
    [info] CSV: 14158 ms, RCFile: 10691 ms
    [info] CSV: 13606 ms, RCFile: 10346 ms
    ```
    
    The micro benchmark loads a 609MB CVS file (structurally similar to the 
`src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile 
table, then scans these tables respectively.
    
    Preparation code:
    
    ```scala
    package org.apache.spark.examples.sql.hive
    
    import org.apache.spark.sql.hive.LocalHiveContext
    import org.apache.spark.{SparkConf, SparkContext}
    
    object HiveTableScanPrepare extends App {
      val sparkContext = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
    
      val hiveContext = new LocalHiveContext(sparkContext)
    
      import hiveContext._
    
      hql("drop table scan_csv")
      hql("drop table scan_rcfile")
    
      hql("""create table scan_csv (key int, value string)
            |  row format serde 
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
            |  with serdeproperties ('field.delim'=',')
          """.stripMargin)
    
      hql(s"""load data local inpath "${args(0)}" into table scan_csv""")
    
      hql("""create table scan_rcfile (key int, value string)
            |  row format serde 
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
            |stored as
            |  inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
            |  outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
          """.stripMargin)
    
      hql(
        """
          |from scan_csv
          |insert overwrite table scan_rcfile
          |select scan_csv.key, scan_csv.value
        """.stripMargin)
    }
    ```
    
    Benchmark code:
    
    ```scala
    package org.apache.spark.examples.sql.hive
    
    import org.apache.spark.sql.hive.LocalHiveContext
    import org.apache.spark.{SparkConf, SparkContext}
    
    object HiveTableScanBenchmark extends App {
      val sparkContext = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
    
      val hiveContext = new LocalHiveContext(sparkContext)
    
      import hiveContext._
    
      val scanCsv = hql("select key from scan_csv")
      val scanRcfile = hql("select key from scan_rcfile")
    
      val csvDuration = benchmark(scanCsv.count())
      val rcfileDuration = benchmark(scanRcfile.count())
    
      println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")
    
      def benchmark(f: => Unit) = {
        val begin = System.currentTimeMillis()
        f
        val end = System.currentTimeMillis()
        end - begin
      }
    }
    ```
    
    @marmbrus Please help review.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark fastHiveTableScan

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/758.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #758
    
----
commit 964087fd96c5f5034d79988a0d7d76733561b610
Author: Cheng Lian <[email protected]>
Date:   2014-05-11T06:41:42Z

    [SPARK-1368] Optimized HiveTableScan

commit a3c272b04852bb5135847504d0ad1258fd583ec1
Author: Cheng Lian <[email protected]>
Date:   2014-05-13T16:33:06Z

    Using ColumnProjectionUtils to optimise RCFile and ORC column pruning

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Reply via email to