Jim Challenger created UIMA-2772:
------------------------------------
Summary: DUCC resource manager - Restart and fast-start
Key: UIMA-2772
URL: https://issues.apache.org/jira/browse/UIMA-2772
Project: UIMA
Issue Type: Bug
Components: DUCC
Reporter: Jim Challenger
Assignee: Jim Challenger
Currently RM waits a "reasonable time" (init-stabiity) on startup to allow
nodes to check in, before accepting scheduling requests. It is not possible to
know exactly how long to wait, making init-stability a heuristic. For normal
startup this is not a problem. If RM is restarting 'hot', or if the
orchestrator publishes non-preemptable jobs on restart, and the necessary nodes
have not arrived by the completion of init-stability wait, this can cause many
problems: over-commitment, under-commitment, and in some cases inconsistent
state (and crashes).
To remedy this, RM will include the full Node object in its publications to the
OR, which will echo them back for work that it believes to be active. On
startup RM can fully reconstruct state as of its last publication from this,
eliminating the problem. A side-effect of this is that RM need not wait for
nodes to check in, significantly decreasing its startup time. If nodes added
to the resource pool in this way never check in, the normal "dead node"
mechanism will kick in, maintaining consistency.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira