[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

Thomas Tauber-Marshall (Code Review) Mon, 11 May 2020 15:05:02 -0700

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )


Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
......................................................................


Patch Set 26:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@511
PS26, Line 511: if (!failed_backend_states.empty()) {
Seems like it shouldn't be possible for this to ever be false. Maybe add a 
DCHECK?


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@512
PS26, Line 512: RETURN_IF_ERROR
I don't think its possible for HandleFailedExecRpcs() to return an error, and 
even if it did it probably wouldn't be correct to just return like this here 
since we still need to call UpdateExecState() below to update the query to an 
error status.


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@982
PS26, Line 982:       
parent_query_driver_->TryQueryRetry(parent_request_state_, &retryable_status);
Probably fine to leave as is for now since currently any query that hits this 
is going to end up failing, but just wanted to point out that this code path 
means that we could potentially cancel the query without having actually 
received an error status back yet.

As noted elsewhere, this makes it more difficult to use the AddDetail() on the 
Status out parameter in TryQueryRetry.


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079
PS26, Line 1079:   parent_query_driver_->TryQueryRetry(parent_request_state_, 
&retryable_status);
Wanted to point out that there's an unfortunate situation here and elsewhere 
TryQueryRetry is called where we end up with two slightly different versions 
for the overall query status - the one that gets set on Coordinator (in this 
case, exec_rpc_status, which represents the error from the first backend to 
fail in Exec()) and 'retryable_status' which gets set on ClientRequestState in 
MarkAsRetrying().

Normally, those two statuses would be the same, because ClientRequestState 
would get Coordinator's state returned to it from Exec(), which it would then 
set on itself, it won't set it anymore as we only preserve the first error 
message. I think that's messy and has the potential to be confusing for 
developers.

Personally, I still really think it would be worth it to do the work to at 
least somewhat remove the circular dependency here, eg. you could move the call 
to TryQueryRetry to ClientRequestState::UpdateQueryStatus() along with defining 
a Coordinator::BlacklistedExecutors() or something like that, It wouldn't be 
much work, it would solve this problem, and it would make the control flow a 
lot cleaner, make Coordinator better encapsulated since it wouldn't need to 
know anything about retries, etc.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136:   /// query option 'retry_failed_queries') and if the query can 
be retried. Queries can
> Doesn't return a Status anymore.
Sure, 'status' is used in one code path that calls TryQueryRetry (the code path 
in Coordinator where we get an error in a status report, which 
test_multiple_retries is testing), but its not used in any of the other code 
paths that call TryQueryRetry so it gets lost in those cases (eg. 
HandleFailedExecRpcs)

One option would be to just fix those code paths to actually set the status 
they passed in as the out parameter to the overall error status, but that's not 
always going to be easy (eg. the path in Coordinator where we get an 
AuxErrorInfo but no overall query error status in the report, so we have no 
error status to add the details to).

Instead, maybe it would be better to just add an info string to the profile 
that says "This query was not retried because..."

One disadvantage of that is that it means you can't immediately tell from the 
overall error why the query wasn't retried and have to dig down into the 
profile/logs, but that's already the case since you're only adding the detail 
if we hit the max retries and not for other cases such as if rows had already 
been fetched, though maybe you want to do an AddDetail for those cases too.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204
PS24, Line 204:   /// 'RunFrontendPlanner'.
> Do you mean that CreateRetriedClientRequestState could just pass in a point
Personally, I think the code is way more confusing this way.

You're not "moving" the TExecRequest from the original ClientRequestState to 
the new one - the TExecRequest is just owned by the QueryDriver, so you're 
moving it from one place in the QueryDriver to a different place in the 
QueryDriver.

It doesn't change at any point, but if I was coming along reading this header 
file without the context of this patch, I would assume that since there's two 
separate TExecRequest variables they must be different.

It also unnecessarily complicates the object lifecycle, since now which of the 
two unique_ptr variables is valid changes depending on where in execution we 
are.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc@884
PS24, Line 884:   QueryHandle query_handle;
> I think the point of the QueryHandle is to enforce that whenever users use
I understand that the point of QueryHandle is to make it harder for developers 
to screw up and let the shared_ptr go out of scope while still using the 
ClientRequestState.

My point is that as written it doesn't really do anything to accomplish that.

Taking this function as an example - there's nothing to stop a developer from 
letting 'query_handle' go out of scope while still using 'request_state'. For a 
developer that's just looking at this method, there's nothing to even hint 
that's a requirement.

In fact, the code is exactly identical to the code without QueryHandle except 
just that you replace 'shared_ptr<QueryDriver>' with QueryHandle and the 
declaration of 'request_state' from 'query_driver->GetRequestState()' to 
'query_handle.request_state'. Again, all you've accomplished is that now 
instead of the rule being "you must remember to keep the shared_ptr on the 
stack" its "you must remember to keep the QueryHandle on the stack", so you 
haven't changed anything.

If you used my suggestion - make 'request_state' private in QueryHandle and 
redirect calls on QueryHandle to it with "operator->" the same way that 
unique_ptr works, then for a developer to call any functions on the 
ClientRequestState they have to actually have the QueryHandle object itself, 
thus actually ensuring that the shared_ptr hasn't gone out of scope.


http://gerrit.cloudera.org:8080/#/c/14824/26/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

http://gerrit.cloudera.org:8080/#/c/14824/26/common/thrift/generate_error_codes.py@467
PS26, Line 467:   ("RETRYABLE_ERROR", 151, "Retryable error: $0"),
I'm not certain this is the right thing to do.

It doesn't really seem consistent with how these error codes are used - all of 
the other codes (with maybe the exception of RECOVERABLE_ERROR) say what caused 
the error, whereas 'retryable' is a property that many different errors with 
different causes can have.

If you go this route, is there now an expectation that all errors that are 
potentially retryable use this? Does that prevent us from using more specific 
error codes to differentiate between different errors that are retryable?

The main effect its having in this patch is just to prepend "Retryable error:" 
to some error messages, which makes this less transparent, and I'm not sure how 
helpful that is anyways. To find out if it was in fact retried, if not why, 
etc. users will generally still have to go dig around in logs/the profile. 
Seems like it has more potential to just confuse users than to be helpful.

You also presumably don't want to return RETRYABLE_ERRORs to users that don't 
have retries turned on, but that means that almost identical query executions 
can return different error codes for the same error depending on a query 
option, which seems confusing.

I think you may also find this harder to use without doing things like 
returning RETRYABLE_ERRORs to users that don't even have retries on (which 
would definitely be confusing) once you go address my comments about losing the 
AddDetail from TryQueryRetry



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 26
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Mon, 11 May 2020 22:03:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

Reply via email to