[
https://issues.apache.org/jira/browse/IMPALA-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Tauber-Marshall resolved IMPALA-8339.
--------------------------------------------
Resolution: Fixed
Fix Version/s: Impala 3.3.0
> Coordinator should be more resilient to fragment instances startup failure
> --------------------------------------------------------------------------
>
> Key: IMPALA-8339
> URL: https://issues.apache.org/jira/browse/IMPALA-8339
> Project: IMPALA
> Issue Type: Improvement
> Components: Distributed Exec
> Reporter: Michael Ho
> Assignee: Thomas Tauber-Marshall
> Priority: Critical
> Labels: Availability, resilience
> Fix For: Impala 3.3.0
>
>
> Impala currently relies on statestore for cluster membership. When an Impala
> executor goes offline, it may take a while for statestore to declare that
> node as unavailable and for that information to be propagated to all
> coordinator nodes. Within this window, some coordinator nodes may still
> attempt to issue RPCs to the faulty node, resulting in RPC failures which
> resulted in query failures. In other words, many queries may fail to start
> within this window until all coordinator nodes get the latest information on
> cluster membership.
> Going forward, coordinator may need to fall back to using backup executors
> for each fragments in case some of the executors are not available. Moreover,
> *coordinator should treat the cluster membership information from statestore
> (or any external source of truth e.g. etcd) as hints instead of ground truth*
> and adjust the scheduling of fragment instances based on the availability of
> the executors from the coordinator's perspective.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)