[ 
https://issues.apache.org/jira/browse/IMPALA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531626#comment-16531626
 ] 

ASF subversion and git services commented on IMPALA-7224:
---------------------------------------------------------

Commit 2b6d71fee779088af54cc416ee25027bbd415954 in impala's branch 
refs/heads/master from [~tlipcon]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=2b6d71f ]

IMPALA-7224. Improve performance of UpdateCatalogMetrics

This function is called after every DDL query, and was implemented by
fetching the entire list of table names, even though only the length
of that list was needed. In workloads with millions of tables, this
could add several seconds of overhead following even simple requests
like 'USE' or 'DESCRIBE'.

I tested a backported version of this patch against one such workload.
It reduced the time taken for a simple DESCRIBE query from 12-14sec
down to about 40ms. I also tested locally that the metrics on impalad
were still updated by DDL operations.

Change-Id: Ic5467adbce1e760ff93996925db5611748efafc0
Reviewed-on: http://gerrit.cloudera.org:8080/10846
Reviewed-by: Vuk Ercegovac <vercego...@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstr...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> UpdateCatalogMetrics very slow when there are many tables
> ---------------------------------------------------------
>
>                 Key: IMPALA-7224
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7224
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Major
>
> impalad calls UpdateCatalogMetrics after each statement which is considered a 
> DDL. This includes statements like USE, SHOW TABLES, DESCRIBE, etc, which 
> don't actually change the number of tables in the catalog, and therefore 
> probably don't need to update metrics. That aside, even when the metrics _do_ 
> need to be updated, the implementation is very slow. It calls getTableNames 
> on each database, which results in (a) creating an array of all the names, 
> (b) sorting that array and (c) encoding/decoding that whole array into 
> Thrift. This is very expensive: on a use case with approximately 8M tables, 
> each such call takes 10-12 seconds of CPU, most of which is spent in sorting 
> and encoding. All that's really needed is a _count_ of tables, which could be 
> fetched directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to