[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

Sahil Takiar (Code Review) Tue, 12 May 2020 10:03:20 -0700

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )


Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
......................................................................


Patch Set 24:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@511
PS26, Line 511: if (!failed_backend_states.empty()) {
> Seems like it shouldn't be possible for this to ever be false. Maybe add a
Done


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@512
PS26, Line 512: RETURN_IF_ERROR
> I don't think its possible for HandleFailedExecRpcs() to return an error, a
Done


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@982
PS26, Line 982:     if (!retryable_status.ok()) {
> Probably fine to leave as is for now since currently any query that hits th
Responded in other comment


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079
PS26, Line 1079:
> Normally, those two statuses would be the same, because ClientRequestState 
> would get Coordinator's state returned to it from Exec(), which it would then 
> set on itself, it won't set it anymore as we only preserve the first error 
> message. I think that's messy and has the potential to be confusing for 
> developers.

Not sure I follow.

After this patch and if a retry is triggered, the stack is: 
ClientRequestState::FinishExecQueryOrDmlRequest() --> Coordinator::Exec() --> 
Coordinator::StartBackendExec() --> Coordinator::HandleFailedExecRpcs() --> 
QueryDriver::TryQueryRetry() --> ClientRequestState::MarkAsRetrying() which 
sets ClientRequestState::query_status_.

Before this patch, if a query fails: 
ClientRequestState::FinishExecQueryOrDmlRequest() --> Coordinator::Exec() --> 
Coordinator::StartBackendExec() --> ClientRequestState::UpdateQueryStatus() 
which sets ClientRequestState::query_status_.

Yes, the stack is more complicated. I added some documentation to clarify it.

> it won't set it anymore as we only preserve the first error message

Right, but that seems to be a property of UpdateQueryStatus()'s existing 
behavior, and its called all over the place in ClientRequestState.

I agree the removing the circular dependency would be great to do, but I think 
it's debatable how much effort it is. At one point I had the call to 
TryQueryRetry inside UpdateQueryStatus, but there were a few bugs introduced 
when I did that. I can't remember exactly what they were, but needless to say 
this part of the code is sufficiently complex that even seemingly small changes 
can have unintended consequences.

Moreover, I personally want to defer the re-factoring changes to IMPALA-9370. 
I'm a bit wary of investing large chunks of time re-factoring this code 
multiple times, given the number of follow up tasks under IMPALA-9124, I think 
it is hard to predict how this code will evolve over time. A lot of the 
re-factoring we end up doing now might just be wasted work if we end up 
changing half the logic. Just a thought.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136:   Status TryQueryRetry(ClientRequestState* 
client_request_state, Status* status,
> 'status' is used in one code path that calls TryQueryRetry

Not sure I follow. TryQueryRetry passes the status to MarkAsRetrying which sets 
the given status as the query_status_ of the ClientRequestState, which is what 
gets exposed in the runtime profile.

This was actually getting verified in the tests (up until recently) because 
they asserted that "Retryable error" was always in the "Query Status" field.

> but that's already the case since you're only adding the detail if we hit the 
> max retries and not for other cases such as if rows had already been fetched, 
> though maybe you want to do an AddDetail for those cases too.

I added it for the case when rows have already been fetched.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204
PS24, Line 204:   std::unique_ptr<TExecRequest> retry_exec_request_;
> Personally, I think the code is way more confusing this way.
Deleted it.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc@884
PS24, Line 884:   QueryHandle query_handle;
> I understand that the point of QueryHandle is to make it harder for develop
I added overloads for "->" and "*". I tried my best to mirror the semantics of 
unique_ptr. It required quite a bit of re-factoring, and at some points some 
convoluted pointer manipulation, but I think it is correct.


http://gerrit.cloudera.org:8080/#/c/14824/26/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

http://gerrit.cloudera.org:8080/#/c/14824/26/common/thrift/generate_error_codes.py@467
PS26, Line 467:   ("RETRYABLE_ERROR", 151, "Retryable error: $0"),
> I'm not certain this is the right thing to do.
Removed. It was mainly an attempt to share a common error message prefix 
between the retryable errors, to make the messages more consistent, but I'm not 
sure it matters much.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 24
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Tue, 12 May 2020 17:02:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

Reply via email to