Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 )
Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes ...................................................................... Patch Set 9: (2 comments) responded to a few comments from Thomas, still thinking through some of his other comments. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@922 PS9, Line 922: parent_query_driver_->TryQueryRetry(parent_request_state_, &retryable_status)); yeah, agree it is unfortunate. although i think the issue has always been there. the coordinator class already has a parent_request_state_ that points to the owning ClientRequestState. the ClientRequestState has a similar pattern, it has a parent_server_ which points to the owning ImpalaServer. i'm not sure this is overall a major problem, it does make the code a bit confusing, but I've done my best to limit the usage of the parent_query_driver_ class. only the TryQueryRetry method is ever called on the parent_query_driver_. yeah, it does take the coordinator lock, but only for a pretty short period of time - it just checks some state. the actual work to schedule and run the retried query is done in a separate thread, that doesn't require holding the coordinator lock. acquiring the lock should only be done rarely - only when a query fails with a retryable error. > Would it be possible to instead have the QueryDriver wait on the coordinator > to finish and then check its status and decide whether to retry then? > One problem is the QueryDriver needs to know not just if the query hit an > error but if the error was something retryable, but we could do something > like have the coordinator remember any nodes it blacklists and expose that > info to the QueryDriver. yeah, I considered this at some point, and it might be a cleaner design, just not sure how much re-factoring it is going to require. i can convince myself the current approach of just calling TryRetryQuery in the coordinator is fine as well. I think they can be thought of as "callback" functions into the QueryDriver that get triggered under specific conditions. if you still feel strongly about it, i can do some digging, but would prefer to make the code changes in a follow up patch because i don't think it will be a straightforward / small change to make, and i don't want to expand the scope of this patch further. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@1060 PS9, Line 1060: ExecEnv::GetInstance()->cluster_membership_mgr()->BlacklistExecutor( > This of course doesn't actually guarantee that the retried query won't be s that's a good point, filed IMPALA-9636 to fix this. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 9 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Thu, 09 Apr 2020 17:57:32 +0000 Gerrit-HasComments: Yes
