[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

xuanyuanking Tue, 02 May 2017 02:02:15 -0700

Github user xuanyuanking commented on the issue:

    https://github.com/apache/spark/pull/17702
  
    Thanks for your review. @gatorsmile @cloud-fan 
    
    `Can you show us the performance difference?`
    
    No problem, I reproduce our online case offline like below
    
    ## Without parallel resolve:
    
![image](https://cloud.githubusercontent.com/assets/4833765/25610958/d4b23510-2f57-11e7-9a87-969e215b16c6.png)
    
    ## With parallel resolve:
    
![image](https://cloud.githubusercontent.com/assets/4833765/25611104/6d061908-2f58-11e7-9006-08368d9a6610.png)
    
    ## Test env:
    item | detail
    --------------|---------------
    Spark version | current master 2.2.0-SNAPSHOT
    Hadoop version | 2.7.2
    HDFS | 8 servers(128G, 20 core + 20 hyper threading) 
    Test case | ```spark.read.text("/app/dc/test_for_ls/*/*/*/*").count()``` 
<br/>first level below `test_for_ls` contains 96 directory, each directory has 
1000 directory in next level, the third and forth only have 1 directory and 
file each.
    
    ## Discussion:
    1. More complex scenario and deeper directory level will have more 
optimization, in this local test it can bring us 100% faster.
    2. `spark.sql.globPathInParallel` config will only parallel the process of 
resolve glob path.
    3. If driver and cluster in different geographical region, this improvement 
can produce at least 5* boosting in our scenario because of the resolving work 
becoming a parallel job on the cluster.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

Reply via email to