[
https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101842#comment-13101842
]
Hadoop QA commented on OOZIE-103:
---------------------------------
anew remarked:
My understanding of this issue is that we want to avoid failing workflows when
Hadoop is down. Instead, we want to defer the submission of the workflow until
Hadoop is back. What this requires is:
* Every time a workflow (action) is submitted, we need to know whether Hadoop
is up. If the daemon thread pings Hadoop - say - every minute, then there is a
window of time of 59 seconds where jobs will fail. How do we deal with that?
* When Oozie comes up, the daemon will need up to - say - a minute to detect
that Hadoop is down. Same thing - how do we prevent job submission in that
window? If we persist the blacklist in the DB, then Oozie will remember that a
cluster was down before. If the cluster has come back in the mean time, it will
take up to - say - 1 minute until jobs will be submitted again, but I think
that is acceptable.
* I am not sure whether a configuration at start-up is a good idea. That would
require an admin to create that config before he restarts Oozie, and hence this
would not work, for instance, for automatic failover or restart of Oozie by a
monitoring system. Therefore I would prefer to persist the last known state.
> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
> Key: OOZIE-103
> URL: https://issues.apache.org/jira/browse/OOZIE-103
> Project: Oozie
> Issue Type: Bug
> Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get
> directory listing on hdfs; how long to submit a job or query JT queue)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira