GitHub user piaozhexiu opened a pull request:

    https://github.com/apache/spark/pull/8156

    [SPARK-9926] [SQL] parallelize file listing for partitioned Hive table

    Instead of listing files per partition, listing all the partition together 
in parallel improves query performance quite a bit.
    
    Here are my benchmarks w/ setting 
mapreduce.input.fileinputformat.list-status.num-threads to 25-
    * Parquet-backed Hive table
    * Partitioned by dateint and hour
    * Stored on S3
    
    \# of files|\# of partitions|before|after|improvement|
    ------------|-------------------|---------|------|-----
    972 | 1 | 38 secs | 20 secs | 1.9x
    13646 | 24 | 354  secs | 28 secs | 12x
    136222 | 240 | 3507  secs | 156 secs | 22x
    
    The changes include-
    * In TableReader, compute input splits for all the partitions together and 
store them in a map.
    * In HadoopRDD, first check whether input splits are available for input 
paths in the map and reuse them instead of listing them on a per partition 
basis.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/piaozhexiu/spark SPARK-9926

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8156.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8156
    
----
commit af43f4721a6e32764197d74ed702f4cb4202f55e
Author: Cheolsoo Park <[email protected]>
Date:   2015-08-12T20:30:24Z

    Parallelize file listing for Hive tables

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to