Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20192 )

Change subject: IMPALA-12267: DMLs/DDLs can hang as a result of catalogd restart
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/20192/1/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/20192/1/be/src/service/impala-server.cc@382
PS1, Line 382: -1
> Quanlong may have more experience for safe timeout value.
I think 5 minutes should be enough but maybe don't need that long. There are 
two cases we need this timeout:
1) the case of IMPALA-12267 to avoid coordinator wait infinitely. Any timeout 
value is ok.
2) the case of MPALA-5476: the DDL response comes from the new catalogd and 
coordinator need to wait for the statestore update from it. Note that when 
catalogd launches, it sends the inital update to statestore. So any update in 
statestore should be usable. So I think 2x or 3x of 
statestore_update_frequency_ms should be ok in normal cases. In case the new 
catalogd has intermittent connection issue with statestore that causes the 
catalog update not being sent, coordinator might need to wait more time.

I think limiting the iterations might be more robust than limiting the wait 
time. Each iteration corresponds to an applied statestore update:

        while (cur_service_id == catalog_update_info_.catalog_service_id) {
          catalog_version_update_cv_.Wait(ver_lock);
        }

It seems 3 iterations should be enough. The worse case that needs 3 iterations 
is:
iter0: coordinator is applying the statestore update of the old catalogd while 
the DDL responses from the new catalogd comes and waits
iter1: coordinator starts to apply the statestore update of the new catalogd 
and sees the catalog_service_id changes. It then triggers a full update from 
statestore.
iter2: coordinator applies the full update from statestore then updates the 
catalog_service_id to the latest, which unblocks the DDL.

In case we miss some corner cases, we can choose a larger iteration limit, e.g. 
10.


http://gerrit.cloudera.org:8080/#/c/20192/1/be/src/service/impala-server.cc@2241
PS1, Line 2241:             catalog_version_update_cv_.Wait(ver_lock);
Can we add a log after each wake up? It helps to know how many iterations we 
have waited.



--
To view, visit http://gerrit.cloudera.org:8080/20192
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib71bec8f67f80b0bdfe0a6cc46a16ef624163d8b
Gerrit-Change-Number: 20192
Gerrit-PatchSet: 1
Gerrit-Owner: Daniel Becker <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Mon, 17 Jul 2023 11:15:13 +0000
Gerrit-HasComments: Yes

Reply via email to