Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the 
node fails
......................................................................


Patch Set 12:

> Right, but that RPC has already been retried multiple times, right?
 > I should probably validate that for these specific error codes, the
 > RPCs are actually retried.

Depends on the particular RPC. When we blacklist for a failed Exec() rpc, it 
will never have been retried. For TransmitData(), it will only be retried if 
the error that came back is "server too busy"

 > So depending on an external auto-scaling policy, its possible that
 > blacklisting a node (even for 12 seconds) makes the executor group
 > unhealthy, and causes the auto-scaler to create a new executor
 > group (a potentially expensive operation).

If auto-scaling takes into account blacklisting, I would probably say that's 
the wrong approach and we should think about changing it to only consider the 
statestore view of cluster membership.

 > In general, I think we can be pretty test driven here. The best way
 > to figure out what additional errors codes to add would be to run
 > some network fault injection tests and see what error codes Impala
 > fails with. I'll open a follow up JIRA for that.

Definitely agree that we need to put some work into realistic fault testing, 
and not just rely on our intuition about blacklisting/retry policy.

 > I generally wanted to avoid doing this. I think it would be better
 > for only the Coordinator to make the ultimate decision about
 > whether a node is blacklisted. KrpcDataStreamSender just provides
 > enough information for the Coordinator to make the blacklisting
 > decision. Otherwise the logic for determining if a node should be
 > blacklisted will be spread across multiple classes, making it hard
 > to reason about blacklisting decisions.

I don't think it would be so bad to add a utility function like 
IsRetriableRpcError(kudu::Status) or whatever to keep the logic consistent, but 
I don't feel strongly about how exactly we arrange this, as long as we're 
getting the correct interaction between blacklisting and retrying.

 > However, I think you do bring up a good. We don't want a failed RPC
 > to trigger a query retry unless we are actually blacklisting the
 > target node. After thinking through it some more, maybe marking
 > queries as "retryable" isn't necessary at all. Instead, whenever a
 > query fails + blacklists a node, it is retried. So query retries
 > are blacklisting driven. This would mean abandoning the patch for
 > IMPALA-9138 (Classify certain errors as retryable), which I'm fine
 > with doing. Let me know if you feel differently.

I'm definitely interested in approaches to retries that don't require marking 
statuses retriable (obviously I've had a lot of concerns with that approach). I 
could see some problems with this approach too, such as that we may have 
situations in the future where we want to retry queries without doing any 
blacklisting (eg. if we think there was stale metadata and replanning the query 
will let it succeed), or probably worse - that you could have a query that 
fails due to a failed rpc and reports that the node should be blacklisted, and 
then also hits another error that isn't retry-able but we retry it anyways. Not 
sure how common a situation like that would be, though.

 > > but there's still the problem that this patch will only blacklist
 > > one node per query.
 >
 > Yeah, maybe this is too conservative. Assuming transparent query
 > retries support "x" number of retries, Impala should be able to
 > tolerate "x" faults across the duration of the query (and its
 > retries). Maybe that is okay, maybe not. Don't have strong feelings
 > here.

I think its probably fine to leave as is for now, until we have more confidence 
that blacklisting is really working how we want it to.


--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Mon, 16 Dec 2019 18:37:11 +0000
Gerrit-HasComments: No

Reply via email to