[
https://issues.apache.org/jira/browse/UIMA-2772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613070#comment-13613070
]
Lou DeGenaro commented on UIMA-2772:
------------------------------------
Update transport such that DuccProcess and DuccReservation carry Node field,
and provide getter/setter and constructors employing same.
Update orchestrator to employ above newly added constructors.
Code delivered.
> DUCC resource manager - Restart and fast-start
> ----------------------------------------------
>
> Key: UIMA-2772
> URL: https://issues.apache.org/jira/browse/UIMA-2772
> Project: UIMA
> Issue Type: Bug
> Components: DUCC
> Reporter: Jim Challenger
> Assignee: Jim Challenger
>
> Currently RM waits a "reasonable time" (init-stabiity) on startup to allow
> nodes to check in, before accepting scheduling requests. It is not possible
> to know exactly how long to wait, making init-stability a heuristic. For
> normal startup this is not a problem. If RM is restarting 'hot', or if the
> orchestrator publishes non-preemptable jobs on restart, and the necessary
> nodes have not arrived by the completion of init-stability wait, this can
> cause many problems: over-commitment, under-commitment, and in some cases
> inconsistent state (and crashes).
> To remedy this, RM will include the full Node object in its publications to
> the OR, which will echo them back for work that it believes to be active. On
> startup RM can fully reconstruct state as of its last publication from this,
> eliminating the problem. A side-effect of this is that RM need not wait for
> nodes to check in, significantly decreasing its startup time. If nodes added
> to the resource pool in this way never check in, the normal "dead node"
> mechanism will kick in, maintaining consistency.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira