Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
......................................................................


Patch Set 9:

(2 comments)

responded to a few comments from Thomas, still thinking through some of his 
other comments.

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@922
PS9, Line 922:           
parent_query_driver_->TryQueryRetry(parent_request_state_, &retryable_status));
yeah, agree it is unfortunate. although i think the issue has always been 
there. the coordinator class already has a parent_request_state_ that points to 
the owning ClientRequestState. the ClientRequestState has a similar pattern, it 
has a parent_server_ which points to the owning ImpalaServer.
i'm not sure this is overall a major problem, it does make the code a bit 
confusing, but I've done my best to limit the usage of the parent_query_driver_ 
class. only the TryQueryRetry method is ever called on the parent_query_driver_.
yeah, it does take the coordinator lock, but only for a pretty short period of 
time - it just checks some state. the actual work to schedule and run the 
retried query is done in a separate thread, that doesn't require holding the 
coordinator lock. acquiring the lock should only be done rarely - only when a 
query fails with a retryable error.

> Would it be possible to instead have the QueryDriver wait on the coordinator 
> to finish and then check its status and decide whether to retry then?

> One problem is the QueryDriver needs to know not just if the query hit an 
> error but if the error was something retryable, but we could do something 
> like have the coordinator remember any nodes it blacklists and expose that 
> info to the QueryDriver.

yeah, I considered this at some point, and it might be a cleaner design, just 
not sure how much re-factoring it is going to require.
i can convince myself the current approach of just calling TryRetryQuery in the 
coordinator is fine as well. I think they can be thought of as "callback" 
functions into the QueryDriver that get triggered under specific conditions.

if you still feel strongly about it, i can do some digging, but would prefer to 
make the code changes in a follow up patch because i don't think it will be a 
straightforward / small change to make, and i don't want to expand the scope of 
this patch further.


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@1060
PS9, Line 1060:   
ExecEnv::GetInstance()->cluster_membership_mgr()->BlacklistExecutor(
> This of course doesn't actually guarantee that the retried query won't be s
that's a good point, filed IMPALA-9636 to fix this.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Thu, 09 Apr 2020 17:57:32 +0000
Gerrit-HasComments: Yes

Reply via email to