GitHub user piaozhexiu opened a pull request:
https://github.com/apache/spark/pull/8156
[SPARK-9926] [SQL] parallelize file listing for partitioned Hive table
Instead of listing files per partition, listing all the partition together
in parallel improves query performance quite a bit.
Here are my benchmarks w/ setting
mapreduce.input.fileinputformat.list-status.num-threads to 25-
* Parquet-backed Hive table
* Partitioned by dateint and hour
* Stored on S3
\# of files|\# of partitions|before|after|improvement|
------------|-------------------|---------|------|-----
972 | 1 | 38 secs | 20 secs | 1.9x
13646 | 24 | 354 secs | 28 secs | 12x
136222 | 240 | 3507 secs | 156 secs | 22x
The changes include-
* In TableReader, compute input splits for all the partitions together and
store them in a map.
* In HadoopRDD, first check whether input splits are available for input
paths in the map and reuse them instead of listing them on a per partition
basis.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/piaozhexiu/spark SPARK-9926
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8156.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8156
----
commit af43f4721a6e32764197d74ed702f4cb4202f55e
Author: Cheolsoo Park <[email protected]>
Date: 2015-08-12T20:30:24Z
Parallelize file listing for Hive tables
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]