[jira] [Commented] (IMPALA-8339) Coordinator should be more resilient to fragment instances startup failure

ASF subversion and git services (JIRA) Tue, 30 Jul 2019 08:23:27 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896223#comment-16896223
 ]


ASF subversion and git services commented on IMPALA-8339:
---------------------------------------------------------

Commit dfc968dff1eabc71871e2d941fbe9539944c3e88 in impala's branch 
refs/heads/master from Thomas Tauber-Marshall
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dfc968d ]

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

When a query is scheduled and there is currently some blacklisted
executors, a new line 'Blacklisted Executors:' is added to the profile
listing the hostnames of all such executors.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
- Manual randomized testing locally with iptables.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Reviewed-on: http://gerrit.cloudera.org:8080/13868
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Coordinator should be more resilient to fragment instances startup failure
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-8339
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8339
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Distributed Exec
>            Reporter: Michael Ho
>            Assignee: Thomas Tauber-Marshall
>            Priority: Critical
>              Labels: Availability, resilience
>
> Impala currently relies on statestore for cluster membership. When an Impala 
> executor goes offline, it may take a while for statestore to declare that 
> node as unavailable and for that information to be propagated to all 
> coordinator nodes. Within this window, some coordinator nodes may still 
> attempt to issue RPCs to the faulty node, resulting in RPC failures which 
> resulted in query failures. In other words, many queries may fail to start 
> within this window until all coordinator nodes get the latest information on 
> cluster membership.
> Going forward, coordinator may need to fall back to using backup executors 
> for each fragments in case some of the executors are not available. Moreover, 
> *coordinator should treat the cluster membership information from statestore 
> (or any external source of truth e.g. etcd) as hints instead of ground truth* 
> and adjust the scheduling of fragment instances based on the availability of 
> the executors from the coordinator's perspective.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-8339) Coordinator should be more resilient to fragment instances startup failure

Reply via email to