[
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150799#comment-15150799
]
Ryan Blue commented on SPARK-10340:
-----------------------------------
>From discussion on the pull request, it looks like the solution is to add
>parallelism to UnionRDD rather than use the Qubole approach of listing a
>common prefix and filtering. I'm also linking to HADOOP-12810, which fixes a
>problem in FileSystem that was causing a FileStatus to be fetched for each
>file. The performance numbers show that both fixes address the performance
>problem without adding an alternate code path that avoids delegating to the
>InputFormat for split calculations on S3. I think the way forward is to get
>the UnionRDD patch committed and backport the fix for FileSystem.
> Use S3 bulk listing for S3-backed Hive tables
> ---------------------------------------------
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.4.1, 1.5.0
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input
> paths as a parameter and returns all the objects whose prefixes start with
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can
> significantly speed up input split calculation using S3 bulk listing. This
> optimization is particularly useful for queries like {{select * from
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/]
> from Qubole on this topic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]