Michael Ho created IMPALA-8339:
----------------------------------

             Summary: Coordinator should be more resilient to fragment 
instances startup failure
                 Key: IMPALA-8339
                 URL: https://issues.apache.org/jira/browse/IMPALA-8339
             Project: IMPALA
          Issue Type: Improvement
          Components: Distributed Exec
            Reporter: Michael Ho


Impala currently relies on statestore for cluster membership. When an Impala 
executor goes offline, it may take a while for statestore to declare that node 
as unavailable and for that information to be propagated to all coordinator 
nodes. Within this window, some coordinator nodes may still attempt to issue 
RPCs to the faulty node, resulting in RPC failures which resulted in query 
failures. In other words, many queries may fail to start within this window 
until all coordinator nodes get the latest information on cluster membership.

Going forward, coordinator may need to fall back to using backup executors for 
each fragments in case some of the executors are not available. Moreover, 
*coordinator should treat the cluster membership information from statestore 
(or any external source of truth e.g. etcd) as hints instead of ground truth* 
and adjust the scheduling of fragment instances based on the availability of 
the executors from the coordinator's perspective.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to