Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/17702 Thanks for your review. @gatorsmile @cloud-fan `Can you show us the performance difference?` No problem, I reproduce our online case offline like below ## Without parallel resolve: ![image](https://cloud.githubusercontent.com/assets/4833765/25610958/d4b23510-2f57-11e7-9a87-969e215b16c6.png) ## With parallel resolve: ![image](https://cloud.githubusercontent.com/assets/4833765/25611104/6d061908-2f58-11e7-9006-08368d9a6610.png) ## Test env: item | detail --------------|--------------- Spark version | current master 2.2.0-SNAPSHOT Hadoop version | 2.7.2 HDFS | 8 servers(128G, 20 core + 20 hyper threading) Test case | ```spark.read.text("/app/dc/test_for_ls/*/*/*/*").count()``` <br/>first level below `test_for_ls` contains 96 directory, each directory has 1000 directory in next level, the third and forth only have 1 directory and file each. ## Discussion: 1. More complex scenario and deeper directory level will have more optimization, in this local test it can bring us 100% faster. 2. `spark.sql.globPathInParallel` config will only parallel the process of resolve glob path. 3. If driver and cluster in different geographical region, this improvement can produce at least 5* boosting in our scenario because of the resolving work becoming a parallel job on the cluster.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org