[
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-10340:
-------------------------------------
Target Version/s: (was: 1.6.0)
> Use S3 bulk listing for S3-backed Hive tables
> ---------------------------------------------
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.4.1, 1.5.0
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input
> paths as a parameter and returns all the objects whose prefixes start with
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can
> significantly speed up input split calculation using S3 bulk listing. This
> optimization is particularly useful for queries like {{select * from
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/]
> from Qubole on this topic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]