Sahil Takiar has posted comments on this change. (
http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the
node fails
......................................................................
Patch Set 12:
Responding to comments from @Thomas.
> Then there's blacklisting, which only takes a single failed rpc to
> get a node blacklisted
Right, but that RPC has already been retried multiple times, right? I should
probably validate that for these specific error codes, the RPCs are actually
retried.
> So, I don't think that we need to be overly afraid about
> blacklisting things aggressively, since if the node actually isn't
> unhealthy blacklisting it once isn't going to have that big of an
> impact on the cluster.
One caveat here is with executor groups. IIUC there is a concept of "executor
group health". If enough nodes in an executor group are either blacklisted /
removed via the statestore, Impala will mark that executor group as unhealthy,
and then stop scheduling any queries on it.
Stolen from the commit message for IMPALA-8484:
Executors can be configured with an executor group through the newly
added '--executor_groups' flag. [...] Each executor group
specification can optionally contain a minimum size, separated by a
':', e.g. --executor_groups default-pool-1:3. Only when the cluster
membership contains at least that number of executors for the groups
will it be considered for admission.
So depending on an external auto-scaling policy, its possible that blacklisting
a node (even for 12 seconds) makes the executor group unhealthy, and causes the
auto-scaler to create a new executor group (a potentially expensive operation).
> That's particularly the case since this work is aimed at
> facilitating transparent query retry. We definitely want to avoid
> scenarios where a TransmitData rpc fails, we don't blacklist the
> node because the failure doesn't match our whitelist of posix codes
> but we go ahead and retry the query anyways and then it fails again
> because we schedule it on the same bad node again.
Agree. I think the first milestone for transparent query retries is to just
handle scenario where nodes crash while running a query. Based on some
experiments I did, these were the only error codes that came up. I'm not
confident enough in my understanding of Linux error codes to include any more
than the current three, but am open to suggestions.
In general, I think we can be pretty test driven here. The best way to figure
out what additional errors codes to add would be to run some network fault
injection tests and see what error codes Impala fails with. I'll open a follow
up JIRA for that.
> Of course, for this patch and the followup patch, we can just only
> mark the status as retryable if we do in fact decide to blacklist
> the node by doing the check for posix codes in KrpcDataStreamSender
I generally wanted to avoid doing this. I think it would be better for only the
Coordinator to make the ultimate decision about whether a node is blacklisted.
KrpcDataStreamSender just provides enough information for the Coordinator to
make the blacklisting decision. Otherwise the logic for determining if a node
should be blacklisted will be spread across multiple classes, making it hard to
reason about blacklisting decisions.
However, I think you do bring up a good. We don't want a failed RPC to trigger
a query retry unless we are actually blacklisting the target node. After
thinking through it some more, maybe marking queries as "retryable" isn't
necessary at all. Instead, whenever a query fails + blacklists a node, it is
retried. So query retries are blacklisting driven. This would mean abandoning
the patch for
IMPALA-9138 (Classify certain errors as retryable), which I'm fine with doing.
Let me know if you feel differently.
> but there's still the problem that this patch will only blacklist
> one node per query.
Yeah, maybe this is too conservative. Assuming transparent query retries
support "x" number of retries, Impala should be able to tolerate "x" faults
across the duration of the query (and its retries). Maybe that is okay, maybe
not. Don't have strong feelings here.
> To be clear - I think for the first iteration of all of this its
> probably fine to be fairly conservative about what we blacklist,
> and if sometimes queries fail when we could have made them succeed
> by being more aggressive that's okay (so long as they would've
> failed even without the blacklisting work, such that its not a
> regression).
Agree. It feels safer to add functionality to blacklisting incrementally.
> Just wanted to make the points that 1) blacklisting
> was designed in a way where its not that big of a deal and we don't
> need to worry too much about screwing up a cluster by blacklisting
> a node unnecessarily and 2) we probably really want to avoid the
> situation where we retry a query and it fails again for the same
> reason
Agree with both points.
--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Fri, 13 Dec 2019 21:02:30 +0000
Gerrit-HasComments: No