Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 )
Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes ...................................................................... Patch Set 13: (4 comments) Still going through it http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@161 PS13, Line 161: void QueryDriver::RetryQueryFromThread(const Status& error) { Is there anything that prevents the QueryDriver from being deleted while this thread is running? eg. what if the query fails and then while the retry is happening the user cancels the query? Mostly we're relying on the shared_ptr to prevent those sorts of issues, but I don't think we're holding a shared_ptr for the duration of this, since you have to just pass a bare 'this' to Thread::Create http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@178 PS13, Line 178: // Cancel the query. Is this necessary? At least in the case of Coordinator calling TryQueryRetry() we'll end up calling cancel from the Coordinator too when the status is updated to an error. I guess its fine because Cancel() is idempotent, but might be worth thinking about what we can do to make this cleaner. http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@179 PS13, Line 179: true, nullptr Both of these parameters seem wrong - you note above that a query could get retried when its still in the INITIALIZED state, so seems like 'check_inflight=true' could fail, and according to the comment about ClientRequestState::Cancel 'nullptr' for cause is supposed to indicate a user-initiated cancel that doesn't have an associated error, which this isn't. I think in the case of an error that comes from the query itself (i.e. Coordinator call TryQueryRetry()) you'll get the error set anyways when UpdateExecState() is called (unless the AuxErrorInfo comes in a different rpc than the overall error message, in which case I think there's a race between the coordinator getting cancelled and no longer accepting reports, though I'm not sure), and in the case of a backend crashing (i.e. CancelFromThreadPool calls TryQueryRetry()), I don't think the error will get set. http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@281 PS13, Line 281: // TODO: IMPALA-9502: Avoid copying TExecRequest when retrying queries Out of curiosity, what are the issues here? -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 13 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Thu, 30 Apr 2020 23:04:08 +0000 Gerrit-HasComments: Yes
