[
https://issues.apache.org/jira/browse/UIMA-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251855#comment-15251855
]
Lou DeGenaro commented on UIMA-4903:
------------------------------------
There are 2 critical determinations that affect the course of a Job in the
presence of JP failures:
1. Was the JP initializing or not?
2. Was the cause Framework or User?
These questions are answered in part by interpreting the Agent's
ReasonForStoppingProcess.
With respect to #2, the cause is presumed to be System unless it is one of {
Croaked, ExceededShareSize, ExceededSwapThreshold, ExceededErrorThreshold }.
See
org.apache.uima.ducc.transport.event.common.DuccProcessConcurrentMap.isUserFailureReasonForStoppingProcess(String
reason).
The error limit for killing a Job only considers failed JPs due to User error.
See
org.apache.uima.ducc.transport.event.common.DuccProcessConcurrentMap.isFailedProcess(IDuccProcess
process) and its callers.
Therefore, OR is working as designed.
Comments have been added above
DuccProcessConcurrentMap.isFailedProcess(IDuccProcess process).
Also, see Jira 4905.
> DUCC Orchestrator (OR) Health Monitor fails to detect too many Job Process
> failures
> -----------------------------------------------------------------------------------
>
> Key: UIMA-4903
> URL: https://issues.apache.org/jira/browse/UIMA-4903
> Project: UIMA
> Issue Type: Improvement
> Components: DUCC
> Reporter: Lou DeGenaro
> Assignee: Lou DeGenaro
> Fix For: 2.1.0-Ducc
>
>
> To assure a failing Job does not mistakenly live forever, the OR health
> monitor should not use initialization completed as a criteria for enforcing
> the too many JP failures limit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)