GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/758
[SPARK-1368][SQL] Optimized HiveTableScan
JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)
This PR introduces two major updates:
- Replaced FP style code with `while` loop and reusable `GenericMutableRow`
object in critical path of `HiveTableScan`.
- Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column
pruning.
My quick micro benchmark suggests these two optimizations made the
optimized version 2x and 2.5x faster when scanning CSV table and RCFile table
respectively:
```
Original:
[info] CSV: 27676 ms, RCFile: 26415 ms
[info] CSV: 27703 ms, RCFile: 26029 ms
[info] CSV: 27511 ms, RCFile: 25962 ms
Optimized:
[info] CSV: 13820 ms, RCFile: 10402 ms
[info] CSV: 14158 ms, RCFile: 10691 ms
[info] CSV: 13606 ms, RCFile: 10346 ms
```
The micro benchmark loads a 609MB CVS file (structurally similar to the
`src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile
table, then scans these tables respectively.
Preparation code:
```scala
package org.apache.spark.examples.sql.hive
import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}
object HiveTableScanPrepare extends App {
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
import hiveContext._
hql("drop table scan_csv")
hql("drop table scan_rcfile")
hql("""create table scan_csv (key int, value string)
| row format serde
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
| with serdeproperties ('field.delim'=',')
""".stripMargin)
hql(s"""load data local inpath "${args(0)}" into table scan_csv""")
hql("""create table scan_rcfile (key int, value string)
| row format serde
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
|stored as
| inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
| outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
""".stripMargin)
hql(
"""
|from scan_csv
|insert overwrite table scan_rcfile
|select scan_csv.key, scan_csv.value
""".stripMargin)
}
```
Benchmark code:
```scala
package org.apache.spark.examples.sql.hive
import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}
object HiveTableScanBenchmark extends App {
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
import hiveContext._
val scanCsv = hql("select key from scan_csv")
val scanRcfile = hql("select key from scan_rcfile")
val csvDuration = benchmark(scanCsv.count())
val rcfileDuration = benchmark(scanRcfile.count())
println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")
def benchmark(f: => Unit) = {
val begin = System.currentTimeMillis()
f
val end = System.currentTimeMillis()
end - begin
}
}
```
@marmbrus Please help review.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark fastHiveTableScan
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/758.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #758
----
commit 964087fd96c5f5034d79988a0d7d76733561b610
Author: Cheng Lian <[email protected]>
Date: 2014-05-11T06:41:42Z
[SPARK-1368] Optimized HiveTableScan
commit a3c272b04852bb5135847504d0ad1258fd583ec1
Author: Cheng Lian <[email protected]>
Date: 2014-05-13T16:33:06Z
Using ColumnProjectionUtils to optimise RCFile and ORC column pruning
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---