Tim Armstrong created IMPALA-7872:
-------------------------------------

             Summary: Extended health checks to mark node as down
                 Key: IMPALA-7872
                 URL: https://issues.apache.org/jira/browse/IMPALA-7872
             Project: IMPALA
          Issue Type: Improvement
          Components: Distributed Exec
            Reporter: Tim Armstrong


This is an umbrella JIRA to improve handling of complex failure modes aside 
from fail-stop. The current statestore heartbeat mechanism assumes that an 
Impala daemon that responds to heartbeats is healthy and can be scheduled on. 
Memory-based admission control provides a bit more robustness here by not 
admitting queries on daemons where memory would be oversubscribed.

Examples of failure modes of interest are:
* Hangs, where a particular node can't make progress (the JVM hangs in 
IMPALA-7483 are a good example) on some or all queries.
* Repeated fragment instance startup failures. E.g. where coordinators can't 
successfully start fragments on an impala daemon, because of communication 
errors or other issues.

We can't automatically handle all failure modes, but we could improve handling 
of some common ones, particularly repeated fragment startup failures or hangs. 
The goal would be to degrade more gracefully to avoid repeated failures causing 
a cluster-wide outage. The goal isn't to prevent all failures, just to recover 
to a healthy state automatically in more scenarios.

IMPALA-1760 (graceful shutdown) may give us some better options here, since if 
a node notices that it is somehow unhealthy, it could gracefully remove itself 
from scheduling and restart itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to