[
https://issues.apache.org/jira/browse/IMPALA-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832701#comment-16832701
]
Sahil Takiar commented on IMPALA-8339:
--------------------------------------
If we want to go with a blacklisting approach, Spark built a similar feature
that might be worth looking at:
https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/ (although
things are more complex in the Spark world because of task retries).
Blacklisting is also interesting in the context of query retries; e.g. if a
query fails due to a bad disk, the failed fragments should probably be retried
on a different set of nodes.
> Coordinator should be more resilient to fragment instances startup failure
> --------------------------------------------------------------------------
>
> Key: IMPALA-8339
> URL: https://issues.apache.org/jira/browse/IMPALA-8339
> Project: IMPALA
> Issue Type: Improvement
> Components: Distributed Exec
> Reporter: Michael Ho
> Priority: Major
> Labels: Availability, resilience
>
> Impala currently relies on statestore for cluster membership. When an Impala
> executor goes offline, it may take a while for statestore to declare that
> node as unavailable and for that information to be propagated to all
> coordinator nodes. Within this window, some coordinator nodes may still
> attempt to issue RPCs to the faulty node, resulting in RPC failures which
> resulted in query failures. In other words, many queries may fail to start
> within this window until all coordinator nodes get the latest information on
> cluster membership.
> Going forward, coordinator may need to fall back to using backup executors
> for each fragments in case some of the executors are not available. Moreover,
> *coordinator should treat the cluster membership information from statestore
> (or any external source of truth e.g. etcd) as hints instead of ground truth*
> and adjust the scheduling of fragment instances based on the availability of
> the executors from the coordinator's perspective.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]