Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
......................................................................


Patch Set 13:

(4 comments)

Still going through it

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@161
PS13, Line 161: void QueryDriver::RetryQueryFromThread(const Status& error) {
Is there anything that prevents the QueryDriver from being deleted while this 
thread is running? eg. what if the query fails and then while the retry is 
happening the user cancels the query? Mostly we're relying on the shared_ptr to 
prevent those sorts of issues, but I don't think we're holding a shared_ptr for 
the duration of this, since you have to just pass a bare 'this' to 
Thread::Create


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@178
PS13, Line 178:   // Cancel the query.
Is this necessary? At least in the case of Coordinator calling TryQueryRetry() 
we'll end up calling cancel from the Coordinator too when the status is updated 
to an error. I guess its fine because Cancel() is idempotent, but might be 
worth thinking about what we can do to make this cleaner.


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@179
PS13, Line 179: true, nullptr
Both of these parameters seem wrong - you note above that a query could get 
retried when its still in the INITIALIZED state, so seems like 
'check_inflight=true' could fail, and according to the comment about 
ClientRequestState::Cancel 'nullptr' for cause is supposed to indicate a 
user-initiated cancel that doesn't have an associated error, which this isn't.

I think in the case of an error that comes from the query itself (i.e. 
Coordinator call TryQueryRetry()) you'll get the error set anyways when 
UpdateExecState() is called (unless the AuxErrorInfo comes in a different rpc 
than the overall error message, in which case I think there's a race between 
the coordinator getting cancelled and no longer accepting reports, though I'm 
not sure), and in the case of a backend crashing (i.e. CancelFromThreadPool 
calls TryQueryRetry()), I don't think the error will get set.


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@281
PS13, Line 281:   // TODO: IMPALA-9502: Avoid copying TExecRequest when 
retrying queries
Out of curiosity, what are the issues here?



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Thu, 30 Apr 2020 23:04:08 +0000
Gerrit-HasComments: Yes

Reply via email to