GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/12095

    [SPARK-14259] [SQL] Merging small files together based on the cost of 
opening

    ## What changes were proposed in this pull request?
    
    This PR basically re-do the things in #12068 but with a different model, 
which should work better in case of small files with different sizes.
    
    ## How was this patch tested?
    
    Updated existing tests.
    
    Ran a query on thousands of partitioned small files locally, with all 
default settings (the cost to open a file should be over estimated), the 
durations of tasks become smaller and smaller, which is good (the last few 
tasks will be shortest).
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark file_cost

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12095.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12095
    
----
commit d2e28cbc11734076d9fce187ed425fa64f9b3e36
Author: Davies Liu <[email protected]>
Date:   2016-03-31T19:57:33Z

    Merging the files based on cost

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to