[ 
https://issues.apache.org/jira/browse/OOZIE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137779#comment-14137779
 ] 

Rohini Palaniswamy commented on OOZIE-1813:
-------------------------------------------

>From [[email protected]]:
  currently the diff of current time and coord job start time is used to check
abandoned job "older_than".

for the case that coord job is in catch up mode, and its start time is earlier
than current time, than it will be considered as "older" job to kill even
though it is created just now.

however coord job can be created at present, and its start time is in the
future. using coord job created time as the base may not be accurate either.

Thanks for catching this Michelle. So the buffer of 2 days should be max of 
(created time, start time). And OOZIE-1813-Amendment-V1.patch addresses that.

+1 Pending jenkins.

> Add service to report/kill rogue bundles and coordinator jobs
> -------------------------------------------------------------
>
>                 Key: OOZIE-1813
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1813
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Purshotam Shah
>            Assignee: Purshotam Shah
>             Fix For: trunk
>
>         Attachments: OOZIE-1813-Amendment-V1.patch, OOZIE-1813-V2.patch, 
> OOZIE-1813-V3.patch, OOZIE-1813-V4.patch, OOZIE-1813-V5.patch, 
> OOZIE-1813-V6.patch, OOZIE-1813-V7.patch, OOZIE-1813-V8.patch
>
>
> People leave their test coordinator and bundle jobs without ever killing them
> and they just eat up resources heavily. We should have a service which 
> periodically check for abandoned coords and report/kill them.
> We can add multiple logic to this like ( number of consecutive 
> failed/timedout action, total number of failed/timedout action). 
> To start with if number of coord action with failed/timedout status > defined 
> value, then coord is considered to be rogue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to