[GitHub] spark pull request: [SPARK-10340] [SQL] Use S3 bulk listing for S3...

piaozhexiu Fri, 28 Aug 2015 14:15:52 -0700

GitHub user piaozhexiu opened a pull request:

    https://github.com/apache/spark/pull/8512


    [SPARK-10340] [SQL] Use S3 bulk listing for S3-backed Hive tables

    This PR includes #8156 because SPARK-10340 depends on SPARK-9926.
    
    When `spark.sql.hive.s3BulkListing` is also set true, file listing for 
partitioned Hive table stored on S3 is done using 
`AmazonS3Client.listNextBatchOfObjects` instead of using FileSystem (i.e. S3N). 
By default, the property is set false.
    
    Here is my benchmark results for the following queries-
    ```
    1 partition: select * from nccp_log where dateint=20150601 and hour=0 limit 
10;
    24 partitions: select * from nccp_log where dateint=20150601 limit 10;
    240 partitions: select * from nccp_log where dateint>=20150601 and 
dateint<=20150610 limit 10;
    720 partitions: select * from nccp_log where dateint>=20150601 and 
dateint<=20150630 limit 10;
    ```
    
    \# of files|\# of partitions|current (a) |SPARK-9926 / 10 threads (b) 
|SPARK-10340 (c)|
    ------------|-------------------|---------|------|-----
    972 | 1 | 38s | 36s | 27s
    13646 | 24 | 354s | 41s | 30s
    136222 | 240 | 1h | 321s | 195s
    445377 | 720 | 3h | 1194s | 274s
    
    
![image](https://cloud.githubusercontent.com/assets/179618/9556501/fda27662-4d89-11e5-8ce1-0b5d7bab02d1.png)
    
    All the S3 bulk listing is implemented in `SparkS3Util` object. The code is 
mostly based on 
[PrestoS3FileSystem](https://github.com/facebook/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java)
 and 
[FileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/piaozhexiu/spark SPARK-10340

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8512.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8512
    
----
commit f9a17c59d61ba9f63f33fb6dca531edbb9c7e4e8
Author: Cheolsoo Park <[email protected]>
Date:   2015-08-12T20:30:24Z

    Parallelize file listing for Hive tables

commit 09dfbba076b5ee389dda8aab558e93719d148cc0
Author: Cheolsoo Park <[email protected]>
Date:   2015-08-13T02:14:05Z

    Fix style error for 100 char limit

commit 5c2aecf844960a7d2150de88076d1d55cff83e9c
Author: Cheolsoo Park <[email protected]>
Date:   2015-08-13T15:47:02Z

    Fix unit test failures

commit acdc184bec3d88bb4393ca1eb2a9482d107c38f0
Author: Cheolsoo Park <[email protected]>
Date:   2015-08-17T17:22:12Z

    Replace central cache in SparkHadoopUtil with local cache in HadoopRDD

commit 982f0dd08eb05ab47a5e5fe187a3853c8ab48f55
Author: Cheolsoo Park <[email protected]>
Date:   2015-08-28T17:59:47Z

    Implement s3 bulk listing for partitioned hive table

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10340] [SQL] Use S3 bulk listing for S3...

Reply via email to