Jim Challenger created UIMA-2772:
------------------------------------

             Summary: DUCC resource manager - Restart and fast-start
                 Key: UIMA-2772
                 URL: https://issues.apache.org/jira/browse/UIMA-2772
             Project: UIMA
          Issue Type: Bug
          Components: DUCC
            Reporter: Jim Challenger
            Assignee: Jim Challenger


Currently RM waits a "reasonable time" (init-stabiity) on startup to allow 
nodes to check in, before accepting scheduling requests.  It is not possible to 
know exactly how long to wait, making init-stability a heuristic.  For normal 
startup this is not a problem.  If RM is restarting 'hot', or if the 
orchestrator publishes non-preemptable jobs on restart, and the necessary nodes 
have not arrived by the completion of init-stability wait, this can cause many 
problems: over-commitment, under-commitment, and in some cases  inconsistent 
state (and crashes).

To remedy this, RM will include the full Node object in its publications to the 
OR, which will echo them back for work that it believes to be active. On 
startup RM can fully reconstruct state as of its last publication from this, 
eliminating the problem. A side-effect of this is that RM need not wait for 
nodes to check in, significantly decreasing its startup time.  If nodes added 
to the resource pool in this way never check in, the normal "dead node" 
mechanism will kick in, maintaining consistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to