GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/11242

    SPARK-9926: Parallelize partition logic in UnionRDD.

    This patch has the new logic from #8512 that uses a parallel collection to 
compute partitions in UnionRDD. The rest of #8512 added an alternative code 
path for calculating splits in S3, but that isn't necessary to get the same 
speedup. The underlying problem wasn't that bulk listing wasn't used, it was 
that an extra FileStatus was retrieved for each file. The fix was just 
committed as 
[HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810). (I think 
the original commit also used a single prefix to enumerate all paths, but that 
isn't always helpful and it was removed in later versions so there is no need 
for SparkS3Utils.)
    
    I tested this using the same table that @piapiaozhexiu was using. 
Calculating splits for a 10-day period took 25 seconds with this change and 
HADOOP-12810, which is on par with the results from #8512.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark SPARK-9926-parallelize-union-rdd

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11242.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11242
    
----
commit 535128ccb96ec22b95a286ae3f736abfa9ab8002
Author: Cheolsoo Park <[email protected]>
Date:   2015-08-12T20:30:24Z

    SPARK-9926: Parallelize partition logic in UnionRDD.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to