[Impala-ASF-CR] IMPALA-5476: Fix Catalogd restart bring about metadata is out of sync

Quanlong Huang (Code Review) Fri, 20 Aug 2021 00:58:03 -0700

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17645 )

Change subject: IMPALA-5476: Fix Catalogd restart bring about metadata is out
of sync
......................................................................

Patch Set 23: Code-Review+2

(7 comments)

The patch is in good shape. Let's make the descriptions and comments clear!

http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@9
PS23, Line 9: The change of catalogServiceId did not trigger a full topic update
: request.
:
: During the execution of DDL or DML, restarting Catalogd will cause
: catalogServiceId to change. The DDL will execute successfully, and
: then directly update the local cache (not through Statestored),
modify
: the local catalog with the new catalogServiceId, trigger an
exception,
: print the exception information, but did not submit full topic
update
: request. The version of Catalogd is lower than Impalad, so the
metadata
: information synchronized from Catalogd will be lost.
nit: ImpaladCatalog#updateCatalog() doesn't trigger a full topic update request
when detecting catalogServiceId changes. It just updates the local
catalogServiceId and throws an exception to abort applying the DDL/DML results.
This causes a problem when catalogd is restarted and the DDL/DML is executed on
the restarted instance. In this case, only the local catalogServiceId is
updated to the latest. The local catalog remains stale. Then when dealing with
the following updates from statestore, the catalogServiceId always matches, so
updates will be applied without exceptions. However, the catalog objects
usually won't be updated since they have higher versions (from the old catalogd
instance) than those in the update. This brings the local catalog out of sync
until the catalog version of the new catalogd grows larger enough.

http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@20
PS23, Line 20: Under normal circumstances, if the catalogServiceId changes,
Impalad
: will submit full topic update request, Impalad will receive
: TUpdateCatalogCacheRequest, and is_delta is false, that is full
update.
: Impalad will reset the local catalog cache, and then update, and
the
: version of CatalogD is greater than Impalad.
nit: Note that in dealing with the catalog updates from statestore, if the
catalogServiceId unmatches, impalad will request a full topic update. See more
in ImpalaServer::CatalogUpdateCallback().

http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@25
PS23, Line 25:
Let's add some description about how we fix the bug:

This patch fixes this issue by checking the catalogServiceId before invoking
UpdateCatalogCache() of FE. If catalogServiceId doesn't match the one in the
DDL/DML result, wait until it changes. The following update from statestore
will change it and unblocks the DDL/DML thread.

http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@28
PS23, Line 28: TestRestart#test_restart_catalogd
nit: several tests

http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@29
PS23, Line 29: ests
nit: tests

http://gerrit.cloudera.org:8080/#/c/17645/23/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/17645/23/be/src/service/impala-server.cc@2146
PS23, Line 2146: // TODO Restart catalogd twice. When restarting for the
second time, ddl executes
: // successfully, but we may receive the service id of the
first restart, which will
: // cause the local cache to not be updated in time
nit: could you move these into the else-block at line 2164 and change them to

// We can't apply updates on another service id, because the local catalog is
still inconsistent with the catalogd that executes the DDL. Catalogd may be
restarted more than once inside a statestore update cycle. 'cur_service_id'
could belong to 1) a stale update from the previous restarted catalogd, or 2) a
newer update from next restarted catalogd. We are good to ignore the DDL result
at the second case. However, in the first case clients may see stale catalog
until the expected catalog topic update comes. TODO: handle the first case in
IMPALA-10875.

http://gerrit.cloudera.org:8080/#/c/17645/23/tests/custom_cluster/test_restart_services.py
File tests/custom_cluster/test_restart_services.py:

http://gerrit.cloudera.org:8080/#/c/17645/23/tests/custom_cluster/test_restart_services.py@230
PS23, Line 230: execute_query_async
Currently execute_query() and execute_query_async() are the same for DDLs. But
could you change this to execute_query()? This may cause test failure if one
day we fix IMPALA-718. In that case, execute_query_async will return
immediately and thread.join() at line 236 doesn't guard anything.

--
To view, visit http://gerrit.cloudera.org:8080/17645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9fe25f5a2a42fb432e306ef08ae35750c8f3c50c
Gerrit-Change-Number: 17645
Gerrit-PatchSet: 23
Gerrit-Owner: liuyao <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: liuyao <[email protected]>
Gerrit-Comment-Date: Fri, 20 Aug 2021 07:57:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5476: Fix Catalogd restart bring about metadata is out of sync

Reply via email to