GitHub user watermen opened a pull request:

    https://github.com/apache/spark/pull/8125

    [SPARK-8813][SQL] Combine files when there're many small files in table

    Many small files in table will lead to many small tasks in Spark. It will 
affect performance. So this patch use `coalesce` operator to combine small 
files. It'll cover `TableReader` and 
`HadoopFsRelation(ParquetRelation/OrcRelation/JSONRelation)`.
    Many users use sql in `spark-sql` shell, and 
    ```
    sqlContext.read.parquet("hdfs://some/path").coalesce(1).collect()
    ```
     advised by @liancheng is not convenience in this case.
    This patch add two configurations below(if it is ok, I'll add them to doc):
    
    | Property Name | Default | Meaning |
    | ------------- | ------- | ------- |
    | spark.sql.small.file.combine | true | Whether to combine small file.
    | spark.sql.small.file.split.size | 256000000 | The size of split after 
combine small file.
    
    /cc @liancheng @marmbrus @scwf 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/watermen/spark SPARK-8813-new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8125
    
----
commit 3ae34b46edc482065ba94471d73fb649b5b72785
Author: Yadong Qi <[email protected]>
Date:   2015-08-12T09:52:01Z

    Combine small files.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to