Todd Lipcon created IMPALA-7320:
-----------------------------------

             Summary: Loading HDFS tables calls getFileStatus on each partition 
serially
                 Key: IMPALA-7320
                 URL: https://issues.apache.org/jira/browse/IMPALA-7320
             Project: IMPALA
          Issue Type: Improvement
          Components: Catalog
    Affects Versions: Impala 3.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon


The catalog caches the access level (permissions) of each of the partitions in 
an HDFS table. This is all loaded when the table is first loaded, and is done 
so by making serial calls to getFileStatus() on each of the partitions. In most 
case, all of the partitions are in a single directory and we could get all of 
the information through a single call to listFileStatus() on the parent. In my 
testing, a typical getFileStatus call took 1-2 milliseconds, so on a large 
table with tens of thousands of partitions this can shave many seconds off of 
the table load time as well as reduce load on the NN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to