[
https://issues.apache.org/jira/browse/OOZIE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122509#comment-14122509
]
Purshotam Shah commented on OOZIE-1813:
---------------------------------------
New patch, which include
1.
{quote}
Actually, on the final patch, can you add the new config properties to
oozie-default.xml?
{quote}
2.
{quote}
Also can you add a check to only kill the coord job if it is older than 2 days?
If there was something submitted and lot of failures initially this would kill
the coord job. Should give user sometime to correct any error and rerun if
needed.
{quote}
3.
{quote}
In HA, this service should only run on primary server.
{quote}
> Add service to report/kill rogue bundles and coordinator jobs
> -------------------------------------------------------------
>
> Key: OOZIE-1813
> URL: https://issues.apache.org/jira/browse/OOZIE-1813
> Project: Oozie
> Issue Type: Bug
> Reporter: Purshotam Shah
> Assignee: Purshotam Shah
> Attachments: OOZIE-1813-V2.patch, OOZIE-1813-V3.patch,
> OOZIE-1813-V4.patch, OOZIE-1813-V5.patch, OOZIE-1813-V6.patch,
> OOZIE-1813-V7.patch, OOZIE-1813-V8.patch
>
>
> People leave their test coordinator and bundle jobs without ever killing them
> and they just eat up resources heavily. We should have a service which
> periodically check for abandoned coords and report/kill them.
> We can add multiple logic to this like ( number of consecutive
> failed/timedout action, total number of failed/timedout action).
> To start with if number of coord action with failed/timedout status > defined
> value, then coord is considered to be rogue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)