Hi Quanlong, You're right. The catalog needs to handle metadata at a finer granularity. We are actively looking into the options you mentioned as well as other related changes (see IMPALA-3234 and IMPALA-3127) to improve the performance and scalability of metadata management.
Thanks Dimitris On Mon, Sep 11, 2017 at 8:51 PM, Quanlong Huang <[email protected]> wrote: > Hi all, > > > Currently if a "describe" statement hits an incomplete table, the impalad > will send an RPC request to the catalogd for loading metadata of this > table. It will take a long time for tables with many partitions and many > files. However, to serve the "describe" statement, we just need the > metadata in Hive MetaStore. In my experiments (with > load_catalog_in_background=false), it take hours to describe a large > table. This statement is pretty cheap in Hive or Presto. Users may worry > about whether impala is set up correctly. > > > Can we add a more fine grain strategy about loading the metadata? For > queries just hit one partition of a huge table, we don't need to load all > the file descriptors as well. For example, more levels to trigger metadata > load: > Level1. Load metadata from Hive MetaStore > Level2. Load file descriptors of given partitions > Level3. Load all file descriptors > > > Then we can serve the following scenario better: > 1. describe a large table > 2. run query on one or several partitions of this table. (Each partition > has few files) > > > Do we have some discussion about this before? > > > Thanks > Quanlong
