GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/8346

    [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as 
the min split size if necessary.

    https://issues.apache.org/jira/browse/SPARK-10143
    
    I tested it locally. The table I have has 343MB and it is in my local FS. 
Because I did not set any min/max split size, the default split size was 32MB 
and the map stage had 11 tasks. But there were only three tasks that actually 
read data. With my PR, there were only three tasks in the map stage. Here is 
the difference.
    
    Without this PR:
    
![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png)
    
    With this PR:
    
![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png)
    
    Even if the block size setting does match the actual block size of parquet 
file, I think it is still generally good to use parquet's block size setting if 
min split size is smaller than this block size.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark parquetMinSplit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8346.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8346
    
----
commit e460545f2976d40adaf54d609553b7399ec7c6a2
Author: Yin Huai <[email protected]>
Date:   2015-08-21T00:59:05Z

    Use parquet's block size (row group size) setting as the min split size if 
necessary.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to