[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

cloud-fan Tue, 12 Jun 2018 16:01:13 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17702
  
    This approach only works if the first level glob pattern matches a lot of 
directories, e.g. `/my_path/*/*`. Otherwise, we can't apply it, e.g. 
`/my_path/{ab, cd}/*`.
    
    My proposal: think about how glob works
    1. split path into parts, e.g. `/a/*/*` -> `a, *, *`
    2. for each path part, expand it if it's glob pattern, then flatMap the 
expanded results and expand the next path part, repeat until the last path part.
    
    Step by step, we first expand `/a/*/*` to `/a/b1/*; /a/b2/*`, and then 
`/a/b1/c1; /a/b1/c2; /a/b2/c1; /a/b2/c2`. Theoritically, we can add a check in 
each step, if the current to-be-expanded list is above a treshold, do the next 
expand in parallel.
    
    Maybe we should just fork the Hadoop `Globber` and improve it to run in 
parallel.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

Reply via email to