[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

jinxing64 Thu, 12 Apr 2018 02:14:59 -0700

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181014951
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
    --- End diff --
    
    @cloud-fan @jiangxb1987
    Thanks a lot for review.
    > Em... It seems we have to check all the levels unless we have specified a 
value for each partition column. We can make some improvement here but seems 
that require more complicated approach.
    Yes, true. In this change, I only optimize when user specify for each 
partition column, which is very common in the production -- as our user always 
did: `select xxx from yyy where year=yy and month=mm and day=dd`
    
    I'm not sure about you guys idea:  leave the current logic as it is(at 
least the code logic now is very simple)? or implement a more complicated 
approach and defend as many cases as possible? or do some improvement based on 
this pr and cover some very common cases?
    
    Thanks again for review :)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Reply via email to