GitHub user concretevitamin opened a pull request:

    https://github.com/apache/spark/pull/1390

    [SPARK-2443][SQL] Fix slow read from partitioned tables 

    This simply incorporates Shark's 
[#329](https://github.com/amplab/shark/pull/329) into Spark SQL. Implementation 
credit to @chiragaggarwal.
    
    @marmbrus @rxin @chenghao-intel
    
    ## Benchmarks
    Generated a local text file with 10M rows of simple key-value pairs. The 
data is loaded as a table through Hive. Results are obtained on my local 
machine using hive/console.
    
    Without the fix:
    
    Non-partitioned | Partitioned (1 part)
    ------------ | -------------
    First run: 9.52s end-to-end (1.64s Spark job) | First run: 36.6s (28.3s)
    Stablized runs: 1.21s (1.18s) | Stablized runs: 27.6s (27.5s)
    
    With this fix:
    
    Non-partitioned | Partitioned (1 part)
    ------------ | -------------
    First run: 9.57s (1.46s) | First run: 9.30s (1.45s)
    Stablized runs: 1.13s (1.10s) | Stablized runs: 1.18s (1.15s)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/concretevitamin/spark slow-read

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1390.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1390
    
----
commit 403f460c644308126b6f3ab5dda66fa6b1872ce9
Author: Zongheng Yang <[email protected]>
Date:   2014-07-12T22:52:47Z

    Incorporate shark/pull/329 into Spark SQL.
    
    Credit to @chiragaggarwal.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to