[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Sahil Takiar (Code Review) Fri, 13 Dec 2019 13:02:48 -0800

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )


Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the 
node fails
......................................................................


Patch Set 12:

Responding to comments from @Thomas.

 > Then there's blacklisting, which only takes a single failed rpc to
 > get a node blacklisted

Right, but that RPC has already been retried multiple times, right? I should 
probably validate that for these specific error codes, the RPCs are actually 
retried.

 > So, I don't think that we need to be overly afraid about
 > blacklisting things aggressively, since if the node actually isn't
 > unhealthy blacklisting it once isn't going to have that big of an
 > impact on the cluster.

One caveat here is with executor groups. IIUC there is a concept of "executor 
group health". If enough nodes in an executor group are either blacklisted / 
removed via the statestore, Impala will mark that executor group as unhealthy, 
and then stop scheduling any queries on it.

Stolen from the commit message for IMPALA-8484:

    Executors can be configured with an executor group through the newly
    added '--executor_groups' flag. [...] Each executor group
    specification can optionally contain a minimum size, separated by a
    ':', e.g. --executor_groups default-pool-1:3. Only when the cluster
    membership contains at least that number of executors for the groups
    will it be considered for admission.

So depending on an external auto-scaling policy, its possible that blacklisting 
a node (even for 12 seconds) makes the executor group unhealthy, and causes the 
auto-scaler to create a new executor group (a potentially expensive operation).

 > That's particularly the case since this work is aimed at
 > facilitating transparent query retry. We definitely want to avoid
 > scenarios where a TransmitData rpc fails, we don't blacklist the
 > node because the failure doesn't match our whitelist of posix codes
 > but we go ahead and retry the query anyways and then it fails again
 > because we schedule it on the same bad node again.

Agree. I think the first milestone for transparent query retries is to just 
handle scenario where nodes crash while running a query. Based on some 
experiments I did, these were the only error codes that came up. I'm not 
confident enough in my understanding of Linux error codes to include any more 
than the current three, but am open to suggestions.

In general, I think we can be pretty test driven here. The best way to figure 
out what additional errors codes to add would be to run some network fault 
injection tests and see what error codes Impala fails with. I'll open a follow 
up JIRA for that.

 > Of course, for this patch and the followup patch, we can just only
 > mark the status as retryable if we do in fact decide to blacklist
 > the node by doing the check for posix codes in KrpcDataStreamSender

I generally wanted to avoid doing this. I think it would be better for only the 
Coordinator to make the ultimate decision about whether a node is blacklisted. 
KrpcDataStreamSender just provides enough information for the Coordinator to 
make the blacklisting decision. Otherwise the logic for determining if a node 
should be blacklisted will be spread across multiple classes, making it hard to 
reason about blacklisting decisions.

However, I think you do bring up a good. We don't want a failed RPC to trigger 
a query retry unless we are actually blacklisting the target node. After 
thinking through it some more, maybe marking queries as "retryable" isn't 
necessary at all. Instead, whenever a query fails + blacklists a node, it is 
retried. So query retries are blacklisting driven. This would mean abandoning 
the patch for
IMPALA-9138 (Classify certain errors as retryable), which I'm fine with doing. 
Let me know if you feel differently.

 > but there's still the problem that this patch will only blacklist
 > one node per query.

Yeah, maybe this is too conservative. Assuming transparent query retries 
support "x" number of retries, Impala should be able to tolerate "x" faults 
across the duration of the query (and its retries). Maybe that is okay, maybe 
not. Don't have strong feelings here.

 > To be clear - I think for the first iteration of all of this its
 > probably fine to be fairly conservative about what we blacklist,
 > and if sometimes queries fail when we could have made them succeed
 > by being more aggressive that's okay (so long as they would've
 > failed even without the blacklisting work, such that its not a
 > regression).

Agree. It feels safer to add functionality to blacklisting incrementally.

 > Just wanted to make the points that 1) blacklisting
 > was designed in a way where its not that big of a deal and we don't
 > need to worry too much about screwing up a cluster by blacklisting
 > a node unnecessarily and 2) we probably really want to avoid the
 > situation where we retry a query and it fails again for the same
 > reason

Agree with both points.


--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Fri, 13 Dec 2019 21:02:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Reply via email to