Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17702 This approach only works if the first level glob pattern matches a lot of directories, e.g. `/my_path/*/*`. Otherwise, we can't apply it, e.g. `/my_path/{ab, cd}/*`. My proposal: think about how glob works 1. split path into parts, e.g. `/a/*/*` -> `a, *, *` 2. for each path part, expand it if it's glob pattern, then flatMap the expanded results and expand the next path part, repeat until the last path part. Step by step, we first expand `/a/*/*` to `/a/b1/*; /a/b2/*`, and then `/a/b1/c1; /a/b1/c2; /a/b2/c1; /a/b2/c2`. Theoritically, we can add a check in each step, if the current to-be-expanded list is above a treshold, do the next expand in parallel. Maybe we should just fork the Hadoop `Globber` and improve it to run in parallel.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org