Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 )
Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes ...................................................................... Patch Set 27: (25 comments) > I know I've asked for a lot, but I think the patch is a lot better now. Not a problem, thanks for all the thorough feedback. Don't think this patch would be nearly as clean without all the comments. > I still have a lot of concerns about this feature, esp. about our story > around observability, but I think those concerns can mostly be addressed in > follow up work and most of my remaining comments are more minor things. Right. I don't think the expectation should be that this feature is complete by any means. I'm not sure how feature development has been done historically in Impala, but IMO I think features need to be built piece-meal over multiple-commits, before they can really be used. I won't rant too long about this, but I just think that getting an initial version that users can at least play around with gives us a level of feedback and testing that can't be done within a scoped design review session / single code review. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079 PS26, Line 1079: > I don't think that's true - ClientRequestState::UpdateQueryStatus() is never > called from inside Coordinator. I guess, a '-->' wasn't the correct way to describe this. StartBackendExec() returns an error to Exec() which returns an error to FinishExecQueryOrDmlRequest() which then calls UpdateQueryStatus(). > Now, MarkAsRetrying() will set an error on ClientRequestState but after that > Coordinator::Exec() will still return an error which ClientRequestState will > try to set on itself with UpdateQueryStatus() but it'll get discarded because > an error was already set. Right. > The point is, its messy and I'm concerned it will be error prone. I agree that it is more complex. I'm not sure I understand why it is particularly more error prone. It's not that I don't want to take your suggestion, I just think at this point it is better to address this separately. My hesitation is that doing this re-factoring now is going to introduce bugs into the patch unnecessarily. From experience when first iterating on this patch, doing seemingly harmless re-factoring has introduced a lot of bugs. So I'm not really convinced it is worth the effort at this point in time. Taking a step back, I think at this point in the feature lifecyle, its not worth investing a huge amount of time on this. The feature itself it still immature, and subject to change. It's hard to predict exactly what it will look like after all the planned enhancement, so a lot of this debate might end up being a moot point. I am taking note of this feedback though, and plan to track it in IMPALA-9370. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc@910 PS27, Line 910: Status status = Status::OK(); > I don't think declaring this here is necessary since its not used outside o Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58 PS27, Line 58: std::shared_ptr<QueryDriver> > I think you might want to make this a '&' to avoid unnecessary copies isn't the point here that the pointer returned by query_driver() should be able to out-live the QueryHandle instance? in which case you would need to take a shared_ptr copy to ensure proper ref counting? at least, I thought that was part of the argument for creating the whole "->" and "*" operator overloading. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@93 PS27, Line 93: TryRetryQuery > TryQueryRetry Done http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// > Right, but MarkAsRetrying doesn't get called in the case where you call > AddDetail() because TryQueryRetry returns immediately after AddDetail() Whoops, yeah you are right. I think in that case the error still gets propagated to the the ClientRequestState::query_status_ though, which is why the errors show up in the runtime profile. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc@228 PS27, Line 228: // Run the new query. > I think its fine to leave for now, but just wanted to note that the duplica Yeah - fair point. I tried to do that as much as I could. Most of the calls below are just one off calls to the ClientRequestState - e.g. there isn't that much control flow (or at least I tried to minimize the amount of control flow logic). The other big difference is that the error handling here vs. elsewhere is very different. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@214 PS27, Line 214: Status InitExecRequest(const TQueryCtx& query_ctx); > This is no longer used anywhere Thanks for catching that. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@222 PS27, Line 222: std::string ExecStateToString(ExecState state) const; > static? Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@225 PS27, Line 225: std::string RetryStateToString(RetryState state) const; > static? Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@228 PS27, Line 228: std::shared_ptr<ImpalaServer::SessionState> session() const { return session_; } > I think this copies the shared_ptr, which adds another atomic inc/dec. Is t I think the session can get expired / deleted in various places in ImpalaServer - e.g. an explicit call to close the session, or the session timeout thread. So I think the shared_ptr is necessary for async deletion of the session. At least, that seems to be the pattern with other usages of the session. I re-factored TryQueryRetry a bit so that it only calls session() once, which should decrease the number of times it has to acquire the shared_ptr. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@556 PS27, Line 556: Updated by : /// UpdateQueryStatus > No longer true, due to MarkAsRetrying Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@561 PS27, Line 561: initialized in InitExecRequest > No longer true. Also maybe note that the QueryDriver owns this. Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-beeswax-server.cc File be/src/service/impala-beeswax-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-beeswax-server.cc@420 PS27, Line 420: Don't use : // GetActiveQueryHandle because clients can request profiles for unregistered queries. > Not sure what this is supposed to mean Deleted, I think that comment was from before GetActiveQueryHandle existed, it probably used to be GetQueryHandle and :%s/GetQueryHandle/GetActiveQueryHandle/g inadvertently changed it. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@670 PS27, Line 670: query_driver > query_handle Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@868 PS27, Line 868: /// query_state = union(beeswax::QueryState, ClientRequestState::RetryState). > Probably requires more explanation, eg. "used for the http server..." Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@985 PS27, Line 985: 'request_state' > needs to be updated Changed to the more vague 'exec state', which is consistent with the phrasing of the Beeswax version of this method. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@1024 PS27, Line 1024: ClientRequestState > needs to be updated Done http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1160 PS24, Line 1160: // unregistration work. StartUnregister() succeeds for the first thread to call it to > Ah okay, I hadn't noticed the equivalent started_finalize thing in ClientRe Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@943 PS27, Line 943: (*query_handle).EmplaceQueryDriver(this); > Similar to noted below, you could clean this up by adding a static QueryDri Responded in other comment. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1177 PS27, Line 1177: &* > I think you could just pass a const QueryHandle& here and it would be clean CloseClientRequestState requires a bunch of non-const access - e.g. it passes the CRS into GetExecSummary which needs to call request_state->summary_profile()->SetTExecSummary(t_exec_summary); (which is a non-const function). http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1297 PS27, Line 1297: shared_ptr<QueryDriver> > So I believe this copies the shared_ptr, which results in an extra atomic i Not sure I fully follow your suggestion. I made the following changes, which I think make things a bit cleaner. Added the method SetHandle(...) in QueryHandle, which sets both the QueryDriver and CRS fields. Added the static method CreateQueryDriver(...) in QueryDriver. I made QueryDriver a friend of QueryHandle, but I couldn't get rid of EmplaceQueryDriver. C++ wouldn't let me use private member variables of QueryHandle inside a static method in QueryDriver. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1504 PS27, Line 1504: QueryHandle handle; > I think this is unnecessary and adds another shared_ptr copy. You can just Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@2411 PS27, Line 2411: query_handle > request_state? Done http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py File tests/custom_cluster/test_query_retries.py: http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py@92 PS27, Line 92: def test_kill_impalad_expect_retry(self): > I don't think I see a test case in here for if the retry is attempted due t This was sort of implicitly tested in some other tests, but I added an explicit test for Exec() RPCs, and add some additional validation for that specific case. http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py@363 PS27, Line 363: def test_retry_query_hs2(self): > Fwiw, since beeswax is now deprecated and pretty much all clients use hs2, Yeah, I thought about this while writing these tests. My goal was to make sure that all code paths are covered, which I think this test suite is still achieving. Most of the query retry logic is agnostic to the specific protocol used, so I think whether it is beeswax or hs2, most of the time it shouldn't make a big difference, because the code paths covered will be the same. I think this test should cover all the code paths that are unique to HS2. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 27 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Tue, 12 May 2020 23:01:40 +0000 Gerrit-HasComments: Yes
