[jira] [Commented] (UIMA-4903) DUCC Orchestrator (OR) Health Monitor fails to detect too many Job Process failures

Lou DeGenaro (JIRA) Thu, 21 Apr 2016 05:54:52 -0700

    [ 
https://issues.apache.org/jira/browse/UIMA-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251855#comment-15251855
 ]


Lou DeGenaro commented on UIMA-4903:
------------------------------------

There are 2 critical determinations that affect the course of a Job in the 
presence of JP failures:

1. Was the JP initializing or not?
2. Was the cause Framework or User?

These questions are answered in part by interpreting the Agent's 
ReasonForStoppingProcess.

With respect to #2, the cause is presumed to be System unless it is one of { 
Croaked, ExceededShareSize, ExceededSwapThreshold, ExceededErrorThreshold }.  
See 
org.apache.uima.ducc.transport.event.common.DuccProcessConcurrentMap.isUserFailureReasonForStoppingProcess(String
 reason).

The error limit for killing a Job only considers failed JPs due to User error.  
See 
org.apache.uima.ducc.transport.event.common.DuccProcessConcurrentMap.isFailedProcess(IDuccProcess
 process) and its callers.

Therefore, OR is working as designed.

Comments have been added above 
DuccProcessConcurrentMap.isFailedProcess(IDuccProcess process).

Also, see Jira 4905.

> DUCC Orchestrator (OR) Health Monitor fails to detect too many Job Process 
> failures
> -----------------------------------------------------------------------------------
>
>                 Key: UIMA-4903
>                 URL: https://issues.apache.org/jira/browse/UIMA-4903
>             Project: UIMA
>          Issue Type: Improvement
>          Components: DUCC
>            Reporter: Lou DeGenaro
>            Assignee: Lou DeGenaro
>             Fix For: 2.1.0-Ducc
>
>
> To assure a failing Job does not mistakenly live forever, the OR health 
> monitor should not use initialization completed as a criteria for enforcing 
> the too many JP failures limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (UIMA-4903) DUCC Orchestrator (OR) Health Monitor fails to detect too many Job Process failures

Reply via email to