[
https://issues.apache.org/jira/browse/IMPALA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068622#comment-17068622
]
ASF subversion and git services commented on IMPALA-9549:
---------------------------------------------------------
Commit 1d63348b933b266f63d76b06eecbdf636cb45770 in impala's branch
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1d63348 ]
IMPALA-9549: Handle catalogd startup delays when using local catalog
Impalads should be tolerant of delays in catalogd startup.
Currently, when running with the local catalog
(use_local_catalog=true), impalad startup can fail when catalogd
startup is delayed. What happens is that ImpalaServer's constructor
calls ImpalaServer::UpdateCatalogMetrics(), which maintains two
metrics counting the number of tables and databases. This is before
the code in ImpalaServer::Start() that waits for the catalogd to
start (added by IMPALA-4704), so there is no guarantee that catalogd
is running. The UpdateCatalogMetrics() call ends up calling getDbs()
in the frontend catalog. LocalCatalog::getDbs() tries to load the
databases (and thus contact catalogd), and this call will fail if
catalogd is not running. This fails startup.
use_local_catalog=false is immune to this only because it does not
contact catalogd in Catalog::getDbs().
This moves the UpdateCatalogMetrics() call from the ImpalaServer
constructor to ImpalaServer::Start() after the impalad has already
waited for the catalogd to start up. It also limits the call to
run only in coordinators.
Prior to this change, when using local catalog, the executors would
have catalog.num-databases and catalog.num-tables set to the right
values at startup. These values would not be kept up to date.
With this change, the executors do not have these values set.
Without local catalog, both before and after this change, executors
do not have accurate counts for catalog.num-databases or
catalog.num-tables.
Testing:
- Added a test to custom_cluster.test_catalog_wait to delay catalogd
start up by 60 seconds and verify that the impalads successfully
start up. This test fails prior to this change.
- Hand tested to verify that the metrics that are maintained by
UpdateCatalogMetrics() are not meaningfully changed for coordinators
and that executors do not have metrics set.
Change-Id: I1b5a94c59faaaa25927a169dcb58f310ce6b1044
Reviewed-on: http://gerrit.cloudera.org:8080/15561
Reviewed-by: Vihang Karajgaonkar <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Impalad startup fails to wait for catalogd to startup when using local catalog
> ------------------------------------------------------------------------------
>
> Key: IMPALA-9549
> URL: https://issues.apache.org/jira/browse/IMPALA-9549
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.0
> Reporter: Joe McDonnell
> Assignee: Joe McDonnell
> Priority: Critical
>
> Since Impala coordinators and executors may be starting up at the same time
> as the catalogd, they should be tolerant of delays in the catalogd starting
> up. When using local catalog (use_local_catalog=true), the Impalads fail with
> the following error if the catalogd startup is delayed:
> {noformat}
> I0323 14:22:03.151849 29565 jni-util.cc:288]
> org.apache.impala.catalog.local.LocalCatalogException: Unable to load
> database names
> I0323 14:22:03.151849 29565 jni-util.cc:288]
> org.apache.impala.catalog.local.LocalCatalogException: Unable to load
> database names
> at org.apache.impala.catalog.local.LocalCatalog.loadDbs(LocalCatalog.java:94)
> at org.apache.impala.catalog.local.LocalCatalog.getDbs(LocalCatalog.java:83)
> at org.apache.impala.service.Frontend.getCatalogMetrics(Frontend.java:753)
> at
> org.apache.impala.service.JniFrontend.getCatalogMetrics(JniFrontend.java:220)
> Caused by: org.apache.thrift.TException:
> org.apache.impala.common.InternalException: Couldn't open transport for
> localhost:26000 (connect() failed: Connection refused)
> at
> org.apache.impala.catalog.local.CatalogdMetaProvider.sendRequest(CatalogdMetaProvider.java:382)
> at
> org.apache.impala.catalog.local.CatalogdMetaProvider.access$100(CatalogdMetaProvider.java:174)
> at
> org.apache.impala.catalog.local.CatalogdMetaProvider$1.call(CatalogdMetaProvider.java:583)
> at
> org.apache.impala.catalog.local.CatalogdMetaProvider$1.call(CatalogdMetaProvider.java:578)
> at
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadWithCaching(CatalogdMetaProvider.java:509)
> at
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadDbList(CatalogdMetaProvider.java:577)
> at org.apache.impala.catalog.local.LocalCatalog.loadDbs(LocalCatalog.java:92)
> ... 3 more
> Caused by: org.apache.impala.common.InternalException: Couldn't open
> transport for localhost:26000 (connect() failed: Connection refused)
> at org.apache.impala.service.FeSupport.NativeGetPartialCatalogObject(Native
> Method)
> at
> org.apache.impala.service.FeSupport.GetPartialCatalogObject(FeSupport.java:440)
> at
> org.apache.impala.catalog.local.CatalogdMetaProvider.sendRequest(CatalogdMetaProvider.java:380)
> ... 9 more
> I0323 14:22:03.217051 29565 status.cc:126] LocalCatalogException: Unable to
> load database names
> CAUSED BY: TException: org.apache.impala.common.InternalException: Couldn't
> open transport for localhost:26000 (connect() failed: Connection
> refused){noformat}
> What happens is that the ImpalaServer constructor calls
> ImpalaServer::UpdateCatalogMetrics()
> ([https://github.com/apache/impala/blob/3b833902519fb8f0ef9b5fd20919c5fd85d22fcf/be/src/service/impala-server.cc#L452]
> ). UpdateCatalogMetrics() is maintaining two metrics that track the number
> of databases and the number of tables. This ends up calling
> org.apache.impala.catalog.local.LocalCatalog.getDbs(), which calls loadDbs()
> ([https://github.com/apache/impala/blob/ca0785ec206f27f06d8d6fd1b710779e548bbd8e/fe/src/main/java/org/apache/impala/catalog/local/LocalCatalog.java#L83]
> ). loadDbs() requires a connection to catalogd and will fail if it cannot
> connect.
> Importantly, this all happens before waiting for the catalogd to start up in
> the regular ImpalaServer::Start():
> {code:java}
> if (FLAGS_is_coordinator) exec_env_->frontend()->WaitForCatalog();
> {code}
>
> In the old catalog implementation (use_local_catalog=false), the getDbs()
> call on the catalog returns whatever values it has, and it does not try to
> contact the catalogd. This is why the regular case does not see this problem.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]