Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 )
Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes ...................................................................... Patch Set 24: (16 comments) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc@937 PS24, Line 937: RETURN_IF_ERROR( This isn't valid. (and you have similar issues elsewhere) As far as I can tell, the only way that this returns is if we decide to retry the query and then Thread::Create returns an error, but even in that case we need to still finish this function and eg. call UpdateExecState below with the original error status. In general, errors related to failing to retry aren't ever going to be overall query errors, and we'll always need to finish whatever processing we were doing for the original query, so probably all we need to do is log any errors from TryQueryRetry and move on. It may even be best to log the errors from within TryQueryRetry and not have it even return anything. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@1 PS24, Line 1: // Licensed to the Apache Software Foundation (ASF) under one The distinction isn't that clear, but it might make more sense to put this in /be/src/service, since right now its: impala-server (src/service) -> query-driver (src/runtime) -> client-request-state (src/service) -> coordinator (src/runtime) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@39 PS24, Line 39: QueryHandle I think this name is confusing, esp. since there's already a QueryHandle in beeswax. Its really more of a ClientRequestStateHandle, though that's wordy. Maybe CRSHandle? ClientRequestHandle? It might even be worth renaming ClientRequestState, since if anything the QueryDriver is what really holds the current state of the client request. Maybe QueryInstance? Not a huge deal though http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@82 PS24, Line 82: /// *Transparent* Query Retries: Not a huge deal in this patch since it's still hidden behind a flag and experimental, but I think that we should be cautious about saying that this is completely transparent, since it does change what clients see, potentially in ways that could break existing clients (eg. if you call GetRuntimeProfile() and then do an assert that the returned profile has the same query_id as what you requested, turning this on could cause that assert to fail). Obviously at a minimum we need to test this with all the usual known clients (impala-shell, impyla, jdbc/odbc, Hue, etc.), and we'll probably be making changes to at least some of those clients for the sake of observability around this (eg. impala-shell might want to print a message about the query getting retried) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: Status TryQueryRetry(ClientRequestState* client_request_state, Status* status, The way you're both returning a Status and also returning some status information in a Status out parameter seems confusing, and in fact it looks like at most of the call sites of this function the Status out parameter is never actually referenced again after the call so any info added to it is usually dropped. I think its probably better to just not have 'status' be an out parameter and return use the returned status for everything, or possibly not return any status info at all (see my other comments) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@189 PS24, Line 189: A shared_ptr is used to allow asynchronous deletion ? http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@194 PS24, Line 194: A shared_ptr is used to allow : /// asynchronous deletion. ? http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204 PS24, Line 204: std::unique_ptr<TExecRequest> retry_exec_request_; Now that this is just move()-ed from 'exec_request_' I'm not sure why its still needed. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@97 PS24, Line 97: if (!client_request_state->fetched_rows()) { This function would be more readable if you turned these into: if (reason_not_to_retry) { LOG << reason; return; } if (another_reason) { ... (after doing that, I personally would get rid of RetryAsync() as 1) its only called in one place 2) a lot of its code is just DCHECK-ing things that the ifs here enforce, so that could just be removed and 3) the names TryRetryQuery, RetryAsync, and RetryQueryFromThread get kind of confusing, but up to you) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@302 PS24, Line 302: make_unique Instead of this make_unique/move() thing, why not just have SetOriginalId() take a 'const TUniqueId&' and set the unique_ptr itself? http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc@884 PS24, Line 884: QueryHandle query_handle; I'm not sure that in its current state QueryHandle is really doing much. This is all almost exactly equivalent to if this was just a bare shared_ptr<QueryDriver>, you've just moved the requirement from "developers must remember to keep the shared_ptr<QueryDriver> on the stack while using the ClientRequestState*" to "developers must remember to keep the QueryHandle on the stack...". What if you did something like made the ClientRequestState* in QueryHandle private, and then implemented "operator->" for QueryHandle to effectively have it redirect any calls to the ClientRequestState*? Would also be great to find a way to eliminate the GetClientRequestState/GetActiveClientRequestState functions entirely, to really force use of QueryHandle, but might be difficult without introducing an extra copy (and therefore atomic ref count inc/dec) of the shared_ptr http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-http-handler.cc File be/src/service/impala-http-handler.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-http-handler.cc@809 PS24, Line 809: shared_ptr<QueryDriver> query_driver = server_->GetQueryDriver(query_id); Noted elsewhere, but I think if you added an 'active' flag to GetQueryHandle() or something you could handle situations like these with QueryHandle as well. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.h@932 PS24, Line 932: Status GetQueryHandle( Might be worth naming this GetActiveQueryHandle() for clarity. Would also be good to define a regular GetQueryHandle() and/or add an 'active' flag to this function - I think if you did that, you could get rid of most of the remaining calls to GetQueryDriver() and use QueryHandle for everything. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1160 PS24, Line 1160: RETURN_IF_ERROR(query_handle.query_driver->StartUnregister()); I'm not sure why this is necessary. I guess it has the effect of making UnregisterQuery() idempotent and thread safe, but wouldn't that already need to be the case, eg. because you could have a client that calls CloseImpalaOperation() twice concurrently? (maybe it wasn't?) It also has the effect of making it look to clients like the query no longer exists synchronously with Unregister() instead of having to wait for FinishUnregisterQuery(), but I'm not sure why that would be needed for this patch if the previous behavior was correct (maybe it wasn't?) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1167 PS24, Line 1167: unreg_thread_pool_->Offer(query_handle); I think you'll want to still call move() in situations like this - otherwise you're copying the shared_ptr<QueryDriver>, which results in an atomic inc/dec to the ref counter. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1586 PS24, Line 1586: UnregisterQueryDiscardResult(query_id, true, &retry_status); I'm wondering why we unregister the query in this case, but only cancel it in the other case where we decided not to retry. (I think we don't want to unregister here because we want the query to still be inspectable by the client, since we're the ones that decided to cancel it) -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 24 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Thu, 07 May 2020 21:25:43 +0000 Gerrit-HasComments: Yes
