Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
......................................................................


Patch Set 27:

(25 comments)

> I know I've asked for a lot, but I think the patch is a lot better now.

Not a problem, thanks for all the thorough feedback. Don't think this patch 
would be nearly as clean without all the comments.

> I still have a lot of concerns about this feature, esp. about our story 
> around observability, but I think those concerns can mostly be addressed in 
> follow up work and most of my remaining comments are more minor things.

Right. I don't think the expectation should be that this feature is complete by 
any means. I'm not sure how feature development has been done historically in 
Impala, but IMO I think features need to be built piece-meal over 
multiple-commits, before they can really be used. I won't rant too long about 
this, but I just think that getting an initial version that users can at least 
play around with gives us a level of feedback and testing that can't be done 
within a scoped design review session / single code review.

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079
PS26, Line 1079:
> I don't think that's true - ClientRequestState::UpdateQueryStatus() is never 
> called from inside Coordinator.

I guess, a '-->' wasn't the correct way to describe this. StartBackendExec() 
returns an error to Exec() which returns an error to 
FinishExecQueryOrDmlRequest() which then calls UpdateQueryStatus().

> Now, MarkAsRetrying() will set an error on ClientRequestState but after that 
> Coordinator::Exec() will still return an error which ClientRequestState will 
> try to set on itself with UpdateQueryStatus() but it'll get discarded because 
> an error was already set.

Right.

> The point is, its messy and I'm concerned it will be error prone.

I agree that it is more complex. I'm not sure I understand why it is 
particularly more error prone.

It's not that I don't want to take your suggestion, I just think at this point 
it is better to address this separately. My hesitation is that doing this 
re-factoring now is going to introduce bugs into the patch unnecessarily. From 
experience when first iterating on this patch, doing seemingly harmless 
re-factoring has introduced a lot of bugs. So I'm not really convinced it is 
worth the effort at this point in time.

Taking a step back, I think at this point in the feature lifecyle, its not 
worth investing a huge amount of time on this. The feature itself it still 
immature, and subject to change. It's hard to predict exactly what it will look 
like after all the planned enhancement, so a lot of this debate might end up 
being a moot point.

I am taking note of this feedback though, and plan to track it in IMPALA-9370.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc@910
PS27, Line 910:   Status status = Status::OK();
> I don't think declaring this here is necessary since its not used outside o
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58
PS27, Line 58: std::shared_ptr<QueryDriver>
> I think you might want to make this a '&' to avoid unnecessary copies
isn't the point here that the pointer returned by query_driver() should be able 
to out-live the QueryHandle instance? in which case you would need to take a 
shared_ptr copy to ensure proper ref counting?

at least, I thought that was part of the argument for creating the whole "->" 
and "*" operator overloading.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@93
PS27, Line 93: TryRetryQuery
> TryQueryRetry
Done


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136: ///
> Right, but MarkAsRetrying doesn't get called in the case where you call 
> AddDetail() because TryQueryRetry returns immediately after AddDetail()

Whoops, yeah you are right. I think in that case the error still gets 
propagated to the the ClientRequestState::query_status_ though, which is why 
the errors show up in the runtime profile.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc@228
PS27, Line 228:   // Run the new query.
> I think its fine to leave for now, but just wanted to note that the duplica
Yeah - fair point. I tried to do that as much as I could. Most of the calls 
below are just one off calls to the ClientRequestState - e.g. there isn't that 
much control flow (or at least I tried to minimize the amount of control flow 
logic).

The other big difference is that the error handling here vs. elsewhere is very 
different.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@214
PS27, Line 214:   Status InitExecRequest(const TQueryCtx& query_ctx);
> This is no longer used anywhere
Thanks for catching that.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@222
PS27, Line 222:   std::string ExecStateToString(ExecState state) const;
> static?
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@225
PS27, Line 225:   std::string RetryStateToString(RetryState state) const;
> static?
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@228
PS27, Line 228:   std::shared_ptr<ImpalaServer::SessionState> session() const { 
return session_; }
> I think this copies the shared_ptr, which adds another atomic inc/dec. Is t
I think the session can get expired / deleted in various places in ImpalaServer 
- e.g. an explicit call to close the session, or the session timeout thread. So 
I think the shared_ptr is necessary for async deletion of the session. At 
least, that seems to be the pattern with other usages of the session.

I re-factored TryQueryRetry a bit so that it only calls session() once, which 
should decrease the number of times it has to acquire the shared_ptr.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@556
PS27, Line 556: Updated by
              :   /// UpdateQueryStatus
> No longer true, due to MarkAsRetrying
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@561
PS27, Line 561: initialized in InitExecRequest
> No longer true. Also maybe note that the QueryDriver owns this.
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-beeswax-server.cc@420
PS27, Line 420: Don't use
              :   // GetActiveQueryHandle because clients can request profiles 
for unregistered queries.
> Not sure what this is supposed to mean
Deleted, I think that comment was from before GetActiveQueryHandle existed, it 
probably used to be GetQueryHandle and 
:%s/GetQueryHandle/GetActiveQueryHandle/g inadvertently changed it.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@670
PS27, Line 670: query_driver
> query_handle
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@868
PS27, Line 868:     /// query_state = union(beeswax::QueryState, 
ClientRequestState::RetryState).
> Probably requires more explanation, eg. "used for the http server..."
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@985
PS27, Line 985: 'request_state'
> needs to be updated
Changed to the more vague 'exec state', which is consistent with the phrasing 
of the Beeswax version of this method.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@1024
PS27, Line 1024: ClientRequestState
> needs to be updated
Done


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1160
PS24, Line 1160:   // unregistration work. StartUnregister() succeeds for the 
first thread to call it to
> Ah okay, I hadn't noticed the equivalent started_finalize thing in ClientRe
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@943
PS27, Line 943:   (*query_handle).EmplaceQueryDriver(this);
> Similar to noted below, you could clean this up by adding a static QueryDri
Responded in other comment.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1177
PS27, Line 1177: &*
> I think you could just pass a const QueryHandle& here and it would be clean
CloseClientRequestState requires a bunch of non-const access - e.g. it passes 
the CRS into GetExecSummary which needs to call 
request_state->summary_profile()->SetTExecSummary(t_exec_summary); (which is a 
non-const function).


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1297
PS27, Line 1297: shared_ptr<QueryDriver>
> So I believe this copies the shared_ptr, which results in an extra atomic i
Not sure I fully follow your suggestion. I made the following changes, which I 
think make things a bit cleaner.

Added the method SetHandle(...) in QueryHandle, which sets both the QueryDriver 
and CRS fields.

Added the static method CreateQueryDriver(...) in QueryDriver.

I made QueryDriver a friend of QueryHandle, but I couldn't get rid of 
EmplaceQueryDriver. C++ wouldn't let me use private member variables of 
QueryHandle inside a static method in QueryDriver.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1504
PS27, Line 1504: QueryHandle handle;
> I think this is unnecessary and adds another shared_ptr copy. You can just
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@2411
PS27, Line 2411: query_handle
> request_state?
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py
File tests/custom_cluster/test_query_retries.py:

http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py@92
PS27, Line 92:   def test_kill_impalad_expect_retry(self):
> I don't think I see a test case in here for if the retry is attempted due t
This was sort of implicitly tested in some other tests, but I added an explicit 
test for Exec() RPCs, and add some additional validation for that specific case.


http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py@363
PS27, Line 363:   def test_retry_query_hs2(self):
> Fwiw, since beeswax is now deprecated and pretty much all clients use hs2,
Yeah, I thought about this while writing these tests. My goal was to make sure 
that all code paths are covered, which I think this test suite is still 
achieving.

Most of the query retry logic is agnostic to the specific protocol used, so I 
think whether it is beeswax or hs2, most of the time it shouldn't make a big 
difference, because the code paths covered will be the same.

I think this test should cover all the code paths that are unique to HS2.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 27
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Tue, 12 May 2020 23:01:40 +0000
Gerrit-HasComments: Yes

Reply via email to