Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
......................................................................


Patch Set 24:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc@937
PS24, Line 937: RETURN_IF_ERROR(
This isn't valid. (and you have similar issues elsewhere)

As far as I can tell, the only way that this returns is if we decide to retry 
the query and then Thread::Create returns an error, but even in that case we 
need to still finish this function and eg. call UpdateExecState below with the 
original error status.

In general, errors related to failing to retry aren't ever going to be overall 
query errors, and we'll always need to finish whatever processing we were doing 
for the original query, so probably all we need to do is log any errors from 
TryQueryRetry and move on. It may even be best to log the errors from within 
TryQueryRetry and not have it even return anything.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@1
PS24, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
The distinction isn't that clear, but it might make more sense to put this in 
/be/src/service, since right now its:

impala-server (src/service) -> query-driver (src/runtime) -> 
client-request-state (src/service) -> coordinator (src/runtime)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@39
PS24, Line 39: QueryHandle
I think this name is confusing, esp. since there's already a QueryHandle in 
beeswax. Its really more of a ClientRequestStateHandle, though that's wordy. 
Maybe CRSHandle? ClientRequestHandle?

It might even be worth renaming ClientRequestState, since if anything the 
QueryDriver is what really holds the current state of the client request. Maybe 
QueryInstance? Not a huge deal though


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@82
PS24, Line 82: /// *Transparent* Query Retries:
Not a huge deal in this patch since it's still hidden behind a flag and 
experimental, but I think that we should be cautious about saying that this is 
completely transparent, since it does change what clients see, potentially in 
ways that could break existing clients (eg. if you call GetRuntimeProfile() and 
then do an assert that the returned profile has the same query_id as what you 
requested, turning this on could cause that assert to fail).

Obviously at a minimum we need to test this with all the usual known clients 
(impala-shell, impyla, jdbc/odbc, Hue, etc.), and we'll probably be making 
changes to at least some of those clients for the sake of observability around 
this (eg. impala-shell might want to print a message about the query getting 
retried)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136:   Status TryQueryRetry(ClientRequestState* 
client_request_state, Status* status,
The way you're both returning a Status and also returning some status 
information in a Status out parameter seems confusing, and in fact it looks 
like at most of the call sites of this function the Status out parameter is 
never actually referenced again after the call so any info added to it is 
usually dropped.

I think its probably better to just not have 'status' be an out parameter and 
return use the returned status for everything, or possibly not return any 
status info at all (see my other comments)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@189
PS24, Line 189: A shared_ptr is used to allow asynchronous deletion
?


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@194
PS24, Line 194: A shared_ptr is used to allow
              :   /// asynchronous deletion.
?


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204
PS24, Line 204:   std::unique_ptr<TExecRequest> retry_exec_request_;
Now that this is just move()-ed from 'exec_request_' I'm not sure why its still 
needed.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@97
PS24, Line 97:     if (!client_request_state->fetched_rows()) {
This function would be more readable if you turned these into:

if (reason_not_to_retry) {
LOG << reason;
return;
}
if (another_reason) {
...

(after doing that, I personally would get rid of RetryAsync() as 1) its only 
called in one place 2) a lot of its code is just DCHECK-ing things that the ifs 
here enforce, so that could just be removed and 3) the names TryRetryQuery, 
RetryAsync, and RetryQueryFromThread get kind of confusing, but up to you)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@302
PS24, Line 302: make_unique
Instead of this make_unique/move() thing, why not just have SetOriginalId() 
take a 'const TUniqueId&' and set the unique_ptr itself?


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc@884
PS24, Line 884:   QueryHandle query_handle;
I'm not sure that in its current state QueryHandle is really doing much. This 
is all almost exactly equivalent to if this was just a bare 
shared_ptr<QueryDriver>, you've just moved the requirement from "developers 
must remember to keep the shared_ptr<QueryDriver> on the stack while using the 
ClientRequestState*" to "developers must remember to keep the QueryHandle on 
the stack...".

What if you did something like made the ClientRequestState* in QueryHandle 
private, and then implemented "operator->" for QueryHandle to effectively have 
it redirect any calls to the ClientRequestState*?

Would also be great to find a way to eliminate the 
GetClientRequestState/GetActiveClientRequestState functions entirely, to really 
force use of QueryHandle, but might be difficult without introducing an extra 
copy (and therefore atomic ref count inc/dec) of the shared_ptr


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-http-handler.cc
File be/src/service/impala-http-handler.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-http-handler.cc@809
PS24, Line 809:   shared_ptr<QueryDriver> query_driver = 
server_->GetQueryDriver(query_id);
Noted elsewhere, but I think if you added an 'active' flag to GetQueryHandle() 
or something you could handle situations like these with QueryHandle as well.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.h@932
PS24, Line 932:   Status GetQueryHandle(
Might be worth naming this GetActiveQueryHandle() for clarity.

Would also be good to define a regular GetQueryHandle() and/or add an 'active' 
flag to this function - I think if you did that, you could get rid of most of 
the remaining calls to GetQueryDriver() and use QueryHandle for everything.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1160
PS24, Line 1160:   
RETURN_IF_ERROR(query_handle.query_driver->StartUnregister());
I'm not sure why this is necessary. I guess it has the effect of making 
UnregisterQuery() idempotent and thread safe, but wouldn't that already need to 
be the case, eg. because you could have a client that calls 
CloseImpalaOperation() twice concurrently? (maybe it wasn't?)

It also has the effect of making it look to clients like the query no longer 
exists synchronously with Unregister() instead of having to wait for 
FinishUnregisterQuery(), but I'm not sure why that would be needed for this 
patch if the previous behavior was correct (maybe it wasn't?)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1167
PS24, Line 1167:   unreg_thread_pool_->Offer(query_handle);
I think you'll want to still call move() in situations like this - otherwise 
you're copying the shared_ptr<QueryDriver>, which results in an atomic inc/dec 
to the ref counter.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1586
PS24, Line 1586:         UnregisterQueryDiscardResult(query_id, true, 
&retry_status);
I'm wondering why we unregister the query in this case, but only cancel it in 
the other case where we decided not to retry. (I think we don't want to 
unregister here because we want the query to still be inspectable by the 
client, since we're the ones that decided to cancel it)



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 24
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Thu, 07 May 2020 21:25:43 +0000
Gerrit-HasComments: Yes

Reply via email to