Fang-Yu Rao created HIVE-22947:
----------------------------------
Summary: The method getTableObjectsByName() in
HiveMetaStoreClient.java is slow
Key: HIVE-22947
URL: https://issues.apache.org/jira/browse/HIVE-22947
Project: Hive
Issue Type: Improvement
Components: Standalone Metastore
Reporter: Fang-Yu Rao
Attachments: Benchmark_related_to_IMPALA-9363.pdf
The RPC of {{getTableObjectsByName()}} in {{HiveMetaStoreClient.java}}
([https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L2111-L2114])
is very slow. Specifically, according to an empirical evaluation, to load the
complete metadata of all the tables under a database consisting of 40,000
tables, it takes at least 170 seconds for {{getTableObjectsByName()}} to
complete, whereas it only takes less than 0.5 second for {{getAllTables()}}
([https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L2281-L2288]).
In some use cases, not all the fields under the class of
{{org.apache.hadoop.hive.metastore.api.Table}} are required. For instance, if a
client would only like to determine the type of a table, e.g., an HDFS table or
a Kudu table, then it should suffice to only load the field of {{sd}}, which is
of class {{org.apache.hadoop.hive.metastore.api.StorageDescriptor}}. It would
be great if {{getTableObjectsByName()}} could be made more fine-grained so that
only those required fields specified by the client are retrieved, which could
also possibly reduce the time spent on this RPC.
A spreadsheet is also attached ([^Benchmark_related_to_IMPALA-9363.pdf]), where
the detailed experimental results are provided. In the experiment, as a client
of Hive metastore, the {{catalogd}} of Impala calls {{getTableObjectsByName()}}
to retrieve the complete metadata of tables under a database having 40,000
tables.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)