[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Hadoop QA (JIRA) Fri, 09 Sep 2011 19:31:58 -0700

    [ 
https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101839#comment-13101839
 ]


Hadoop QA commented on OOZIE-103:
---------------------------------

tucu00 remarked:
Option #2 is seems a better approach.

Mohammad, we've discussed this issue in the past and the idea was:

* find API calls to JT/NN that require a fixed processing are lightweight: we 
identified a JT API call and NN API call with fixed processing on the JT and 
NN, fetching JT queues info and listing NN root directory contents.

* find the response time of those API calls under normal load and under over 
load. This has to be done for the JT and NN and it may differ on easy JT/NN 
installation depending on the machine size and cluster size.

* determine the response time threshold for JT and NN for Oozie to do back-off.

* In HadoopAccessorService, before trying to get a FileSystem or a JobClient 
handle, check the response time of the above API calls first, if the values are 
below the threshold then retrieve the FS or JC handle, otherwise backoff 
throwing an exception for a transient error.

* to optimize the above logic, the HadoopAccessorService should to the response 
check  only if the last check was done more than X secs (default 60) ago. And 
if at some point JT/NN is overloaded, HadoopAccessorService should backoff for 
the next Y secs (default 60) without even trying to hit the JT/NN.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get 
> directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Reply via email to