[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

piaozhexiu Tue, 08 Sep 2015 16:34:58 -0700

Github user piaozhexiu commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-138731154

@davies Your commit works great! It is basically on par with my last
commit. Since your commit 1) minimizes code changes, and 2) solves both Hive
table and HadoopFsRelation, it is a superior fix. I just pushed it into the PR.

Here is the result. The "e" (purple) is parallelized UnionRDD-
![image
1](https://cloud.githubusercontent.com/assets/179618/9749875/f6e623b8-5645-11e5-8bc4-0ba9fb9144dd.png)

a: vanila spark
b: fileinputformat w/ 10 threads
c: s3 bulk listing w/ common prefix
d: s3 bulk listing w/ 10 threads
e: s3 bulk listing in union rdd w/ 10 threads

I verified that the total size/number of input splits are correct.

My only question for you at this point is, What name you suggest for the
property that controls the parallelism in `UnionRDD`? Or are we just going to
hardcode it as 10 for now?

@yhuai thank you for asking about testing. I admin I don't have good ideas
about how to test my `SparkS3Util` w/o real S3 data. I looked at [Presto unit
tests](https://github.com/facebook/presto/blob/master/presto-hive/src/test/java/com/facebook/presto/hive/TestPrestoS3FileSystem.java)
for their S3FileSystem, but it also has very limited test coverage. I can add
a similar set of unit tests at least though.

Btw, I have already this patch deployed in my environment as an
experimental version to users. So it is being actively tested at Netflix.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

Reply via email to