[ 
https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101842#comment-13101842
 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

anew remarked:
My understanding of this issue is that we want to avoid failing workflows when 
Hadoop is down. Instead, we want to defer the submission of the workflow until 
Hadoop is back. What this requires is:

* Every time a workflow (action) is submitted, we need to know whether Hadoop 
is up. If the daemon thread pings Hadoop - say - every minute, then there is a 
window of time of 59 seconds where jobs will fail. How do we deal with that?
* When Oozie comes up, the daemon will need up to - say - a minute to detect 
that Hadoop is down. Same thing - how do we prevent job submission in that 
window? If we persist the blacklist in the DB, then Oozie will remember that a 
cluster was down before. If the cluster has come back in the mean time, it will 
take up to - say - 1 minute until jobs will be submitted again, but I think 
that is acceptable.
* I am not sure whether a configuration at start-up is a good idea. That would 
require an admin to create that config before he restarts Oozie, and hence this 
would not work, for instance, for automatic failover or restart of Oozie by a 
monitoring system. Therefore I would prefer to persist the last known state.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get 
> directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to