GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/12095
[SPARK-14259] [SQL] Merging small files together based on the cost of
opening
## What changes were proposed in this pull request?
This PR basically re-do the things in #12068 but with a different model,
which should work better in case of small files with different sizes.
## How was this patch tested?
Updated existing tests.
Ran a query on thousands of partitioned small files locally, with all
default settings (the cost to open a file should be over estimated), the
durations of tasks become smaller and smaller, which is good (the last few
tasks will be shortest).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark file_cost
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12095.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12095
----
commit d2e28cbc11734076d9fce187ed425fa64f9b3e36
Author: Davies Liu <[email protected]>
Date: 2016-03-31T19:57:33Z
Merging the files based on cost
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]