Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 )
Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes ...................................................................... Patch Set 26: (8 comments) http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@511 PS26, Line 511: if (!failed_backend_states.empty()) { Seems like it shouldn't be possible for this to ever be false. Maybe add a DCHECK? http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@512 PS26, Line 512: RETURN_IF_ERROR I don't think its possible for HandleFailedExecRpcs() to return an error, and even if it did it probably wouldn't be correct to just return like this here since we still need to call UpdateExecState() below to update the query to an error status. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@982 PS26, Line 982: parent_query_driver_->TryQueryRetry(parent_request_state_, &retryable_status); Probably fine to leave as is for now since currently any query that hits this is going to end up failing, but just wanted to point out that this code path means that we could potentially cancel the query without having actually received an error status back yet. As noted elsewhere, this makes it more difficult to use the AddDetail() on the Status out parameter in TryQueryRetry. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079 PS26, Line 1079: parent_query_driver_->TryQueryRetry(parent_request_state_, &retryable_status); Wanted to point out that there's an unfortunate situation here and elsewhere TryQueryRetry is called where we end up with two slightly different versions for the overall query status - the one that gets set on Coordinator (in this case, exec_rpc_status, which represents the error from the first backend to fail in Exec()) and 'retryable_status' which gets set on ClientRequestState in MarkAsRetrying(). Normally, those two statuses would be the same, because ClientRequestState would get Coordinator's state returned to it from Exec(), which it would then set on itself, it won't set it anymore as we only preserve the first error message. I think that's messy and has the potential to be confusing for developers. Personally, I still really think it would be worth it to do the work to at least somewhat remove the circular dependency here, eg. you could move the call to TryQueryRetry to ClientRequestState::UpdateQueryStatus() along with defining a Coordinator::BlacklistedExecutors() or something like that, It wouldn't be much work, it would solve this problem, and it would make the control flow a lot cleaner, make Coordinator better encapsulated since it wouldn't need to know anything about retries, etc. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// query option 'retry_failed_queries') and if the query can be retried. Queries can > Doesn't return a Status anymore. Sure, 'status' is used in one code path that calls TryQueryRetry (the code path in Coordinator where we get an error in a status report, which test_multiple_retries is testing), but its not used in any of the other code paths that call TryQueryRetry so it gets lost in those cases (eg. HandleFailedExecRpcs) One option would be to just fix those code paths to actually set the status they passed in as the out parameter to the overall error status, but that's not always going to be easy (eg. the path in Coordinator where we get an AuxErrorInfo but no overall query error status in the report, so we have no error status to add the details to). Instead, maybe it would be better to just add an info string to the profile that says "This query was not retried because..." One disadvantage of that is that it means you can't immediately tell from the overall error why the query wasn't retried and have to dig down into the profile/logs, but that's already the case since you're only adding the detail if we hit the max retries and not for other cases such as if rows had already been fetched, though maybe you want to do an AddDetail for those cases too. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204 PS24, Line 204: /// 'RunFrontendPlanner'. > Do you mean that CreateRetriedClientRequestState could just pass in a point Personally, I think the code is way more confusing this way. You're not "moving" the TExecRequest from the original ClientRequestState to the new one - the TExecRequest is just owned by the QueryDriver, so you're moving it from one place in the QueryDriver to a different place in the QueryDriver. It doesn't change at any point, but if I was coming along reading this header file without the context of this patch, I would assume that since there's two separate TExecRequest variables they must be different. It also unnecessarily complicates the object lifecycle, since now which of the two unique_ptr variables is valid changes depending on where in execution we are. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc@884 PS24, Line 884: QueryHandle query_handle; > I think the point of the QueryHandle is to enforce that whenever users use I understand that the point of QueryHandle is to make it harder for developers to screw up and let the shared_ptr go out of scope while still using the ClientRequestState. My point is that as written it doesn't really do anything to accomplish that. Taking this function as an example - there's nothing to stop a developer from letting 'query_handle' go out of scope while still using 'request_state'. For a developer that's just looking at this method, there's nothing to even hint that's a requirement. In fact, the code is exactly identical to the code without QueryHandle except just that you replace 'shared_ptr<QueryDriver>' with QueryHandle and the declaration of 'request_state' from 'query_driver->GetRequestState()' to 'query_handle.request_state'. Again, all you've accomplished is that now instead of the rule being "you must remember to keep the shared_ptr on the stack" its "you must remember to keep the QueryHandle on the stack", so you haven't changed anything. If you used my suggestion - make 'request_state' private in QueryHandle and redirect calls on QueryHandle to it with "operator->" the same way that unique_ptr works, then for a developer to call any functions on the ClientRequestState they have to actually have the QueryHandle object itself, thus actually ensuring that the shared_ptr hasn't gone out of scope. http://gerrit.cloudera.org:8080/#/c/14824/26/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: http://gerrit.cloudera.org:8080/#/c/14824/26/common/thrift/generate_error_codes.py@467 PS26, Line 467: ("RETRYABLE_ERROR", 151, "Retryable error: $0"), I'm not certain this is the right thing to do. It doesn't really seem consistent with how these error codes are used - all of the other codes (with maybe the exception of RECOVERABLE_ERROR) say what caused the error, whereas 'retryable' is a property that many different errors with different causes can have. If you go this route, is there now an expectation that all errors that are potentially retryable use this? Does that prevent us from using more specific error codes to differentiate between different errors that are retryable? The main effect its having in this patch is just to prepend "Retryable error:" to some error messages, which makes this less transparent, and I'm not sure how helpful that is anyways. To find out if it was in fact retried, if not why, etc. users will generally still have to go dig around in logs/the profile. Seems like it has more potential to just confuse users than to be helpful. You also presumably don't want to return RETRYABLE_ERRORs to users that don't have retries turned on, but that means that almost identical query executions can return different error codes for the same error depending on a query option, which seems confusing. I think you may also find this harder to use without doing things like returning RETRYABLE_ERRORs to users that don't even have retries on (which would definitely be confusing) once you go address my comments about losing the AddDetail from TryQueryRetry -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 26 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Mon, 11 May 2020 22:03:59 +0000 Gerrit-HasComments: Yes
