Todd Lipcon created IMPALA-7320:
-----------------------------------
Summary: Loading HDFS tables calls getFileStatus on each partition
serially
Key: IMPALA-7320
URL: https://issues.apache.org/jira/browse/IMPALA-7320
Project: IMPALA
Issue Type: Improvement
Components: Catalog
Affects Versions: Impala 3.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
The catalog caches the access level (permissions) of each of the partitions in
an HDFS table. This is all loaded when the table is first loaded, and is done
so by making serial calls to getFileStatus() on each of the partitions. In most
case, all of the partitions are in a single directory and we could get all of
the information through a single call to listFileStatus() on the parent. In my
testing, a typical getFileStatus call took 1-2 milliseconds, so on a large
table with tens of thousands of partitions this can shave many seconds off of
the table load time as well as reduce load on the NN.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)