Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17645 )
Change subject: IMPALA-5476: Fix Catalogd restart bring about metadata is out of sync ...................................................................... Patch Set 23: Code-Review+2 (7 comments) The patch is in good shape. Let's make the descriptions and comments clear! http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@9 PS23, Line 9: The change of catalogServiceId did not trigger a full topic update : request. : : During the execution of DDL or DML, restarting Catalogd will cause : catalogServiceId to change. The DDL will execute successfully, and : then directly update the local cache (not through Statestored), modify : the local catalog with the new catalogServiceId, trigger an exception, : print the exception information, but did not submit full topic update : request. The version of Catalogd is lower than Impalad, so the metadata : information synchronized from Catalogd will be lost. nit: ImpaladCatalog#updateCatalog() doesn't trigger a full topic update request when detecting catalogServiceId changes. It just updates the local catalogServiceId and throws an exception to abort applying the DDL/DML results. This causes a problem when catalogd is restarted and the DDL/DML is executed on the restarted instance. In this case, only the local catalogServiceId is updated to the latest. The local catalog remains stale. Then when dealing with the following updates from statestore, the catalogServiceId always matches, so updates will be applied without exceptions. However, the catalog objects usually won't be updated since they have higher versions (from the old catalogd instance) than those in the update. This brings the local catalog out of sync until the catalog version of the new catalogd grows larger enough. http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@20 PS23, Line 20: Under normal circumstances, if the catalogServiceId changes, Impalad : will submit full topic update request, Impalad will receive : TUpdateCatalogCacheRequest, and is_delta is false, that is full update. : Impalad will reset the local catalog cache, and then update, and the : version of CatalogD is greater than Impalad. nit: Note that in dealing with the catalog updates from statestore, if the catalogServiceId unmatches, impalad will request a full topic update. See more in ImpalaServer::CatalogUpdateCallback(). http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@25 PS23, Line 25: Let's add some description about how we fix the bug: This patch fixes this issue by checking the catalogServiceId before invoking UpdateCatalogCache() of FE. If catalogServiceId doesn't match the one in the DDL/DML result, wait until it changes. The following update from statestore will change it and unblocks the DDL/DML thread. http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@28 PS23, Line 28: TestRestart#test_restart_catalogd nit: several tests http://gerrit.cloudera.org:8080/#/c/17645/23//COMMIT_MSG@29 PS23, Line 29: ests nit: tests http://gerrit.cloudera.org:8080/#/c/17645/23/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/17645/23/be/src/service/impala-server.cc@2146 PS23, Line 2146: // TODO Restart catalogd twice. When restarting for the second time, ddl executes : // successfully, but we may receive the service id of the first restart, which will : // cause the local cache to not be updated in time nit: could you move these into the else-block at line 2164 and change them to // We can't apply updates on another service id, because the local catalog is still inconsistent with the catalogd that executes the DDL. Catalogd may be restarted more than once inside a statestore update cycle. 'cur_service_id' could belong to 1) a stale update from the previous restarted catalogd, or 2) a newer update from next restarted catalogd. We are good to ignore the DDL result at the second case. However, in the first case clients may see stale catalog until the expected catalog topic update comes. TODO: handle the first case in IMPALA-10875. http://gerrit.cloudera.org:8080/#/c/17645/23/tests/custom_cluster/test_restart_services.py File tests/custom_cluster/test_restart_services.py: http://gerrit.cloudera.org:8080/#/c/17645/23/tests/custom_cluster/test_restart_services.py@230 PS23, Line 230: execute_query_async Currently execute_query() and execute_query_async() are the same for DDLs. But could you change this to execute_query()? This may cause test failure if one day we fix IMPALA-718. In that case, execute_query_async will return immediately and thread.join() at line 236 doesn't guard anything. -- To view, visit http://gerrit.cloudera.org:8080/17645 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9fe25f5a2a42fb432e306ef08ae35750c8f3c50c Gerrit-Change-Number: 17645 Gerrit-PatchSet: 23 Gerrit-Owner: liuyao <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: liuyao <[email protected]> Gerrit-Comment-Date: Fri, 20 Aug 2021 07:57:50 +0000 Gerrit-HasComments: Yes
