GitHub user piaozhexiu opened a pull request:
https://github.com/apache/spark/pull/8512
[SPARK-10340] [SQL] Use S3 bulk listing for S3-backed Hive tables
This PR includes #8156 because SPARK-10340 depends on SPARK-9926.
When `spark.sql.hive.s3BulkListing` is also set true, file listing for
partitioned Hive table stored on S3 is done using
`AmazonS3Client.listNextBatchOfObjects` instead of using FileSystem (i.e. S3N).
By default, the property is set false.
Here is my benchmark results for the following queries-
```
1 partition: select * from nccp_log where dateint=20150601 and hour=0 limit
10;
24 partitions: select * from nccp_log where dateint=20150601 limit 10;
240 partitions: select * from nccp_log where dateint>=20150601 and
dateint<=20150610 limit 10;
720 partitions: select * from nccp_log where dateint>=20150601 and
dateint<=20150630 limit 10;
```
\# of files|\# of partitions|current (a) |SPARK-9926 / 10 threads (b)
|SPARK-10340 (c)|
------------|-------------------|---------|------|-----
972 | 1 | 38s | 36s | 27s
13646 | 24 | 354s | 41s | 30s
136222 | 240 | 1h | 321s | 195s
445377 | 720 | 3h | 1194s | 274s

All the S3 bulk listing is implemented in `SparkS3Util` object. The code is
mostly based on
[PrestoS3FileSystem](https://github.com/facebook/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java)
and
[FileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/piaozhexiu/spark SPARK-10340
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8512.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8512
----
commit f9a17c59d61ba9f63f33fb6dca531edbb9c7e4e8
Author: Cheolsoo Park <[email protected]>
Date: 2015-08-12T20:30:24Z
Parallelize file listing for Hive tables
commit 09dfbba076b5ee389dda8aab558e93719d148cc0
Author: Cheolsoo Park <[email protected]>
Date: 2015-08-13T02:14:05Z
Fix style error for 100 char limit
commit 5c2aecf844960a7d2150de88076d1d55cff83e9c
Author: Cheolsoo Park <[email protected]>
Date: 2015-08-13T15:47:02Z
Fix unit test failures
commit acdc184bec3d88bb4393ca1eb2a9482d107c38f0
Author: Cheolsoo Park <[email protected]>
Date: 2015-08-17T17:22:12Z
Replace central cache in SparkHadoopUtil with local cache in HadoopRDD
commit 982f0dd08eb05ab47a5e5fe187a3853c8ab48f55
Author: Cheolsoo Park <[email protected]>
Date: 2015-08-28T17:59:47Z
Implement s3 bulk listing for partitioned hive table
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]