[ 
https://issues.apache.org/jira/browse/IMPALA-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832701#comment-16832701
 ] 

Sahil Takiar commented on IMPALA-8339:
--------------------------------------

If we want to go with a blacklisting approach, Spark built a similar feature 
that might be worth looking at: 
https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/ (although 
things are more complex in the Spark world because of task retries).

Blacklisting is also interesting in the context of query retries; e.g. if a 
query fails due to a bad disk, the failed fragments should probably be retried 
on a different set of nodes.

> Coordinator should be more resilient to fragment instances startup failure
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-8339
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8339
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Distributed Exec
>            Reporter: Michael Ho
>            Priority: Major
>              Labels: Availability, resilience
>
> Impala currently relies on statestore for cluster membership. When an Impala 
> executor goes offline, it may take a while for statestore to declare that 
> node as unavailable and for that information to be propagated to all 
> coordinator nodes. Within this window, some coordinator nodes may still 
> attempt to issue RPCs to the faulty node, resulting in RPC failures which 
> resulted in query failures. In other words, many queries may fail to start 
> within this window until all coordinator nodes get the latest information on 
> cluster membership.
> Going forward, coordinator may need to fall back to using backup executors 
> for each fragments in case some of the executors are not available. Moreover, 
> *coordinator should treat the cluster membership information from statestore 
> (or any external source of truth e.g. etcd) as hints instead of ground truth* 
> and adjust the scheduling of fragment instances based on the availability of 
> the executors from the coordinator's perspective.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to