Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the 
node fails
......................................................................


Patch Set 12:

> Depends on the particular RPC. When we blacklist for a failed
 > Exec() rpc, it will never have been retried. For TransmitData(), it
 > will only be retried if the error that came back is "server too
 > busy"

How about connection / connectivity errors? If the connection is broken, do we 
make any attempt to retry connection establishment or do we just fail?

 > If auto-scaling takes into account blacklisting, I would probably
 > say that's the wrong approach and we should think about changing it
 > to only consider the statestore view of cluster membership.

Well if a node is blacklisted, no queries get scheduled on that node right? So 
functionally, the node isn't doing any useful work, so it might make sense for 
the auto-scaler to think it is "dead" and no longer part of the executor group. 
Perhaps a better approach would be to only consider a node as "dead" if it has 
been blacklisted multiple times. I think this is a larger discussion though, we 
haven't spent enough time thinking through node blacklisting + auto-scaler 
integration. I think we should, but probably as a separate item.

 > Definitely agree that we need to put some work into realistic fault
 > testing, and not just rely on our intuition about blacklisting/retry
 > policy.

Filed IMPALA-9253 for this.

 > I'm definitely interested in approaches to retries that don't
 > require marking statuses retriable (obviously I've had a lot of
 > concerns with that approach). I could see some problems with this
 > approach too, such as that we may have situations in the future
 > where we want to retry queries without doing any blacklisting (eg.
 > if we think there was stale metadata and replanning the query will
 > let it succeed)

Yeah, its possible in the future that we might want to retry queries even if a 
node hasn't been blacklisted, but I don't see a strong use case for that type 
of pattern right now. In the interest of not over-engineering the initial 
solution, I'm fine with just making retries blacklist / cluster membership 
driven.

 > or probably worse - that you could have a query
 > that fails due to a failed rpc and reports that the node should be
 > blacklisted, and then also hits another error that isn't retry-able
 > but we retry it anyways. Not sure how common a situation like that
 > would be, though.

Definitely possible, but a bit of an edge case. A query would have to fail for 
two separate reasons at approximately the same time: an error that triggers a 
blacklist, and an error that doesn't. Whenever the Coordinator gets an error, 
it cancels the rest of the query, so the non-blacklist error has to occur 
before the cancellation is processed. Its something we should fix, but I think 
we can defer out of the initial implementation. Filed IMPALA-9124 as a follow 
up.

The fix would probably be to ensure that all failed fragments trigger a 
blacklist or failed because they were cancelled.


--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Mon, 16 Dec 2019 21:30:07 +0000
Gerrit-HasComments: No

Reply via email to