Hi all,
Currently if a "describe" statement hits an incomplete table, the impalad will send an RPC request to the catalogd for loading metadata of this table. It will take a long time for tables with many partitions and many files. However, to serve the "describe" statement, we just need the metadata in Hive MetaStore. In my experiments (with load_catalog_in_background=false), it take hours to describe a large table. This statement is pretty cheap in Hive or Presto. Users may worry about whether impala is set up correctly. Can we add a more fine grain strategy about loading the metadata? For queries just hit one partition of a huge table, we don't need to load all the file descriptors as well. For example, more levels to trigger metadata load: Level1. Load metadata from Hive MetaStore Level2. Load file descriptors of given partitions Level3. Load all file descriptors Then we can serve the following scenario better: 1. describe a large table 2. run query on one or several partitions of this table. (Each partition has few files) Do we have some discussion about this before? Thanks Quanlong
