Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20192 )

Change subject: IMPALA-12267: DMLs/DDLs can hang as a result of catalogd restart
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/20192/1/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/20192/1/be/src/service/impala-server.cc@382
PS1, Line 382: 5
> When catalog_update_info_ really changes, we exit the loop.

hmm, we only exit the loop when the 'catalog_service_id' of it changes. It 
consists of other fields like catalog_version that could also changes in a 
statestore update. But we don't check other fields in the while-loop.

The suggestion is that we should check all fields of 'catalog_update_info_' to 
count statestore updates. If we receive e.g. 10 statestore updates and still 
see the catalog service id unchanged, we can exit the loop.


http://gerrit.cloudera.org:8080/#/c/20192/1/be/src/service/impala-server.cc@2269
PS1, Line 2269: we only got the updates about some but not all restarts
              :       //        - the update about the catalogd that has 
'catalog_service_id' has not
              :       //        arrived yet
> I thought about a situation like this:
Yeah, exactly. In such case (IMPALA-10875), client might see stale metadata.
In theory, there are two cases that we exit the while loop with catalog service 
id changes, i.e. either the id in DDL response is stale or the id in DDL 
response is newer. Client might see stale metadata in the latter case.
And the 3rd case is timeout.


http://gerrit.cloudera.org:8080/#/c/20192/1/tests/custom_cluster/test_restart_services.py
File tests/custom_cluster/test_restart_services.py:

http://gerrit.cloudera.org:8080/#/c/20192/1/tests/custom_cluster/test_restart_services.py@234
PS1, Line 234:     assert "Ignoring catalog update result of catalog service 
ID" in logs
> I added "age" in the select statement on L237.
I think we should check it before the alter statement on L235. Because the 
alter statement might bring the metadata back to normal. What we want to check 
is the client doesn't see stale metadata even if the timeout happens.



--
To view, visit http://gerrit.cloudera.org:8080/20192
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib71bec8f67f80b0bdfe0a6cc46a16ef624163d8b
Gerrit-Change-Number: 20192
Gerrit-PatchSet: 2
Gerrit-Owner: Daniel Becker <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Wed, 19 Jul 2023 14:27:25 +0000
Gerrit-HasComments: Yes

Reply via email to