Github user piaozhexiu commented on the pull request:

    https://github.com/apache/spark/pull/8512#issuecomment-138731154
  
    @davies Your commit works great! It is basically on par with my last 
commit. Since your commit 1) minimizes code changes, and 2) solves both Hive 
table and HadoopFsRelation, it is a superior fix. I just pushed it into the PR.
    
    Here is the result. The "e" (purple) is parallelized UnionRDD- 
    ![image 
1](https://cloud.githubusercontent.com/assets/179618/9749875/f6e623b8-5645-11e5-8bc4-0ba9fb9144dd.png)
    
    a: vanila spark
    b: fileinputformat w/ 10 threads
    c: s3 bulk listing w/ common prefix
    d: s3 bulk listing w/ 10 threads
    e: s3 bulk listing in union rdd w/ 10 threads
    
    I verified that the total size/number of input splits are correct.
    
    My only question for you at this point is, What name you suggest for the 
property that controls the parallelism in `UnionRDD`? Or are we just going to 
hardcode it as 10 for now?
    
    @yhuai thank you for asking about testing. I admin I don't have good ideas 
about how to test my `SparkS3Util` w/o real S3 data. I looked at [Presto unit 
tests](https://github.com/facebook/presto/blob/master/presto-hive/src/test/java/com/facebook/presto/hive/TestPrestoS3FileSystem.java)
 for their S3FileSystem, but it also has very limited test coverage. I can add 
a similar set of unit tests at least though.
    
    Btw, I have already this patch deployed in my environment as an 
experimental version to users. So it is being actively tested at Netflix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to