GitHub user concretevitamin opened a pull request:
https://github.com/apache/spark/pull/1390
[SPARK-2443][SQL] Fix slow read from partitioned tables
This simply incorporates Shark's
[#329](https://github.com/amplab/shark/pull/329) into Spark SQL. Implementation
credit to @chiragaggarwal.
@marmbrus @rxin @chenghao-intel
## Benchmarks
Generated a local text file with 10M rows of simple key-value pairs. The
data is loaded as a table through Hive. Results are obtained on my local
machine using hive/console.
Without the fix:
Non-partitioned | Partitioned (1 part)
------------ | -------------
First run: 9.52s end-to-end (1.64s Spark job) | First run: 36.6s (28.3s)
Stablized runs: 1.21s (1.18s) | Stablized runs: 27.6s (27.5s)
With this fix:
Non-partitioned | Partitioned (1 part)
------------ | -------------
First run: 9.57s (1.46s) | First run: 9.30s (1.45s)
Stablized runs: 1.13s (1.10s) | Stablized runs: 1.18s (1.15s)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/concretevitamin/spark slow-read
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1390.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1390
----
commit 403f460c644308126b6f3ab5dda66fa6b1872ce9
Author: Zongheng Yang <[email protected]>
Date: 2014-07-12T22:52:47Z
Incorporate shark/pull/329 into Spark SQL.
Credit to @chiragaggarwal.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---