Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17645 )

Change subject: IMPALA-5476: Fix Catalogd restart bring about metadata is out 
of sync
......................................................................


Patch Set 23: Code-Review+2

(7 comments)

The patch is in good shape. Let's make the descriptions and comments clear!

http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@9
PS23, Line 9: The change of catalogServiceId did not trigger a full topic update
            : request.
            :
            : During the execution of DDL or DML, restarting Catalogd will cause
            : catalogServiceId to change. The DDL will execute successfully, and
            : then directly update the local cache (not through Statestored), 
modify
            : the local catalog with the new catalogServiceId, trigger an 
exception,
            : print the exception information, but did not submit full topic 
update
            : request. The version of Catalogd is lower than Impalad, so the 
metadata
            : information synchronized from Catalogd will be lost.
nit: ImpaladCatalog#updateCatalog() doesn't trigger a full topic update request 
when detecting catalogServiceId changes. It just updates the local 
catalogServiceId and throws an exception to abort applying the DDL/DML results. 
This causes a problem when catalogd is restarted and the DDL/DML is executed on 
the restarted instance. In this case, only the local catalogServiceId is 
updated to the latest. The local catalog remains stale. Then when dealing with 
the following updates from statestore, the catalogServiceId always matches, so 
updates will be applied without exceptions. However, the catalog objects 
usually won't be updated since they have higher versions (from the old catalogd 
instance) than those in the update. This brings the local catalog out of sync 
until the catalog version of the new catalogd grows larger enough.


http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@20
PS23, Line 20: Under normal circumstances, if the catalogServiceId changes, 
Impalad
             : will submit full topic update request, Impalad will receive
             : TUpdateCatalogCacheRequest, and is_delta is false, that is full 
update.
             : Impalad will reset the local catalog cache, and then update, and 
the
             : version of CatalogD is greater than Impalad.
nit: Note that in dealing with the catalog updates from statestore, if the 
catalogServiceId unmatches, impalad will request a full topic update. See more 
in ImpalaServer::CatalogUpdateCallback().


http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@25
PS23, Line 25:
Let's add some description about how we fix the bug:

This patch fixes this issue by checking the catalogServiceId before invoking 
UpdateCatalogCache() of FE. If catalogServiceId doesn't match the one in the 
DDL/DML result, wait until it changes. The following update from statestore 
will change it and unblocks the DDL/DML thread.


http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@28
PS23, Line 28: TestRestart#test_restart_catalogd
nit: several tests


http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@29
PS23, Line 29: ests
nit: tests


http://gerrit.cloudera.org:8080/#/c/17645/23/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/17645/23/be/src/service/impala-server.cc@2146
PS23, Line 2146:     // TODO Restart catalogd twice. When restarting for the 
second time, ddl executes
               :     // successfully, but we may receive the service id of the 
first restart, which will
               :     // cause the local cache to not be updated in time
nit: could you move these into the else-block at line 2164 and change them to

// We can't apply updates on another service id, because the local catalog is 
still inconsistent with the catalogd that executes the DDL. Catalogd may be 
restarted more than once inside a statestore update cycle. 'cur_service_id' 
could belong to 1) a stale update from the previous restarted catalogd, or 2) a 
newer update from next restarted catalogd. We are good to ignore the DDL result 
at the second case. However, in the first case clients may see stale catalog 
until the expected catalog topic update comes. TODO: handle the first case in 
IMPALA-10875.


http://gerrit.cloudera.org:8080/#/c/17645/23/tests/custom_cluster/test_restart_services.py
File tests/custom_cluster/test_restart_services.py:

http://gerrit.cloudera.org:8080/#/c/17645/23/tests/custom_cluster/test_restart_services.py@230
PS23, Line 230: execute_query_async
Currently execute_query() and execute_query_async() are the same for DDLs. But 
could you change this to execute_query()? This may cause test failure if one 
day we fix IMPALA-718. In that case, execute_query_async will return 
immediately and thread.join() at line 236 doesn't guard anything.



--
To view, visit http://gerrit.cloudera.org:8080/17645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9fe25f5a2a42fb432e306ef08ae35750c8f3c50c
Gerrit-Change-Number: 17645
Gerrit-PatchSet: 23
Gerrit-Owner: liuyao <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: liuyao <[email protected]>
Gerrit-Comment-Date: Fri, 20 Aug 2021 07:57:50 +0000
Gerrit-HasComments: Yes

Reply via email to