Github user xuanyuanking commented on the issue:
https://github.com/apache/spark/pull/17702
Thanks for your review. @gatorsmile @cloud-fan
`Can you show us the performance difference?`
No problem, I reproduce our online case offline like below
## Without parallel resolve:

## With parallel resolve:

## Test env:
item | detail
--------------|---------------
Spark version | current master 2.2.0-SNAPSHOT
Hadoop version | 2.7.2
HDFS | 8 servers(128G, 20 core + 20 hyper threading)
Test case | ```spark.read.text("/app/dc/test_for_ls/*/*/*/*").count()```
<br/>first level below `test_for_ls` contains 96 directory, each directory has
1000 directory in next level, the third and forth only have 1 directory and
file each.
## Discussion:
1. More complex scenario and deeper directory level will have more
optimization, in this local test it can bring us 100% faster.
2. `spark.sql.globPathInParallel` config will only parallel the process of
resolve glob path.
3. If driver and cluster in different geographical region, this improvement
can produce at least 5* boosting in our scenario because of the resolving work
becoming a parallel job on the cluster.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]