Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 )
Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes ...................................................................... Patch Set 24: (8 comments) http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@511 PS26, Line 511: if (!failed_backend_states.empty()) { > Seems like it shouldn't be possible for this to ever be false. Maybe add a Done http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@512 PS26, Line 512: RETURN_IF_ERROR > I don't think its possible for HandleFailedExecRpcs() to return an error, a Done http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@982 PS26, Line 982: if (!retryable_status.ok()) { > Probably fine to leave as is for now since currently any query that hits th Responded in other comment http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079 PS26, Line 1079: > Normally, those two statuses would be the same, because ClientRequestState > would get Coordinator's state returned to it from Exec(), which it would then > set on itself, it won't set it anymore as we only preserve the first error > message. I think that's messy and has the potential to be confusing for > developers. Not sure I follow. After this patch and if a retry is triggered, the stack is: ClientRequestState::FinishExecQueryOrDmlRequest() --> Coordinator::Exec() --> Coordinator::StartBackendExec() --> Coordinator::HandleFailedExecRpcs() --> QueryDriver::TryQueryRetry() --> ClientRequestState::MarkAsRetrying() which sets ClientRequestState::query_status_. Before this patch, if a query fails: ClientRequestState::FinishExecQueryOrDmlRequest() --> Coordinator::Exec() --> Coordinator::StartBackendExec() --> ClientRequestState::UpdateQueryStatus() which sets ClientRequestState::query_status_. Yes, the stack is more complicated. I added some documentation to clarify it. > it won't set it anymore as we only preserve the first error message Right, but that seems to be a property of UpdateQueryStatus()'s existing behavior, and its called all over the place in ClientRequestState. I agree the removing the circular dependency would be great to do, but I think it's debatable how much effort it is. At one point I had the call to TryQueryRetry inside UpdateQueryStatus, but there were a few bugs introduced when I did that. I can't remember exactly what they were, but needless to say this part of the code is sufficiently complex that even seemingly small changes can have unintended consequences. Moreover, I personally want to defer the re-factoring changes to IMPALA-9370. I'm a bit wary of investing large chunks of time re-factoring this code multiple times, given the number of follow up tasks under IMPALA-9124, I think it is hard to predict how this code will evolve over time. A lot of the re-factoring we end up doing now might just be wasted work if we end up changing half the logic. Just a thought. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: Status TryQueryRetry(ClientRequestState* client_request_state, Status* status, > 'status' is used in one code path that calls TryQueryRetry Not sure I follow. TryQueryRetry passes the status to MarkAsRetrying which sets the given status as the query_status_ of the ClientRequestState, which is what gets exposed in the runtime profile. This was actually getting verified in the tests (up until recently) because they asserted that "Retryable error" was always in the "Query Status" field. > but that's already the case since you're only adding the detail if we hit the > max retries and not for other cases such as if rows had already been fetched, > though maybe you want to do an AddDetail for those cases too. I added it for the case when rows have already been fetched. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204 PS24, Line 204: std::unique_ptr<TExecRequest> retry_exec_request_; > Personally, I think the code is way more confusing this way. Deleted it. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc@884 PS24, Line 884: QueryHandle query_handle; > I understand that the point of QueryHandle is to make it harder for develop I added overloads for "->" and "*". I tried my best to mirror the semantics of unique_ptr. It required quite a bit of re-factoring, and at some points some convoluted pointer manipulation, but I think it is correct. http://gerrit.cloudera.org:8080/#/c/14824/26/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: http://gerrit.cloudera.org:8080/#/c/14824/26/common/thrift/generate_error_codes.py@467 PS26, Line 467: ("RETRYABLE_ERROR", 151, "Retryable error: $0"), > I'm not certain this is the right thing to do. Removed. It was mainly an attempt to share a common error message prefix between the retryable errors, to make the messages more consistent, but I'm not sure it matters much. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 24 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Tue, 12 May 2020 17:02:55 +0000 Gerrit-HasComments: Yes
