[
https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101839#comment-13101839
]
Hadoop QA commented on OOZIE-103:
---------------------------------
tucu00 remarked:
Option #2 is seems a better approach.
Mohammad, we've discussed this issue in the past and the idea was:
* find API calls to JT/NN that require a fixed processing are lightweight: we
identified a JT API call and NN API call with fixed processing on the JT and
NN, fetching JT queues info and listing NN root directory contents.
* find the response time of those API calls under normal load and under over
load. This has to be done for the JT and NN and it may differ on easy JT/NN
installation depending on the machine size and cluster size.
* determine the response time threshold for JT and NN for Oozie to do back-off.
* In HadoopAccessorService, before trying to get a FileSystem or a JobClient
handle, check the response time of the above API calls first, if the values are
below the threshold then retrieve the FS or JC handle, otherwise backoff
throwing an exception for a transient error.
* to optimize the above logic, the HadoopAccessorService should to the response
check only if the last check was done more than X secs (default 60) ago. And
if at some point JT/NN is overloaded, HadoopAccessorService should backoff for
the next Y secs (default 60) without even trying to hit the JT/NN.
> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
> Key: OOZIE-103
> URL: https://issues.apache.org/jira/browse/OOZIE-103
> Project: Oozie
> Issue Type: Bug
> Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get
> directory listing on hdfs; how long to submit a job or query JT queue)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira