Load metadata exactly in need

Quanlong Huang Mon, 11 Sep 2017 20:52:23 -0700

Hi all,


Currently if a "describe" statement hits an incomplete table, the impalad will 
send an RPC request to the catalogd for loading metadata of this table. It will 
take a long time for tables with many partitions and many files. However, to 
serve the "describe" statement, we just need the metadata in Hive MetaStore. In 
my experiments (with load_catalog_in_background=false), it take hours to 
describe a large table. This statement is pretty cheap in Hive or Presto. Users 
may worry about whether impala is set up correctly.


Can we add a more fine grain strategy about loading the metadata? For queries 
just hit one partition of a huge table, we don't need to load all the file 
descriptors as well.  For example, more levels to trigger metadata load:
Level1. Load metadata from Hive MetaStore
Level2. Load file descriptors of given partitions
Level3. Load all file descriptors


Then we can serve the following scenario better:
1. describe a large table
2. run query on one or several partitions of this table. (Each partition has 
few files)


Do we have some discussion about this before?


Thanks
Quanlong

Load metadata exactly in need

Reply via email to