[
https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101840#comment-13101840
]
Hadoop QA commented on OOZIE-103:
---------------------------------
mislam77 remarked:
Oozie team at Yahoo plans to implement the following idea:
* Oozie will only monitor whether hadoop is up or down. In this task, it will
not consider if hadoop is slow.
* Oozie will allow a way of blacklisting any hadoop JT or NN. There will be two
ways to make a hadoop server black-listed:
* By admin: Admin can blacklist any hadoop server through command line
interface. Most often for upgrade or maintenance.
* By oozie daemon service: A dedicated thread will ping the hadoop server
(that is not blacklisted by admin) and determine if the server needs to be
blacklisted.
* How to remove items from the blacklist? There are also two ways:
* Admin server can send WS request to take off the server from the
blacklist. If a server is blacklisted by admin, that should be taken off by the
admin only.
* Oozie daemon thread will periodically ping the hadoop server (that are
not blacklisted by admin) and if it sees the server is up, it removes from the
list.
Implementations:
===============
* A new WS endpoint for AdminServlet to allow admin to add/remove a blacklist
item.
* A new table (name could be black_list_resources) will be needed with the
following columns:
* Resource Name (e.g. http://localhost:9000/jobtracker)
* Resource Type (hadoop-JT)
* Creator (Admin/oozie)
* Created time (UTC)
* Last modified time (UTC)
* More...
* A new monitor service needs to be implemented. It will do the following
tasks:
* At the initialization, it will create an object to represent the contents
of the current black-list-resources table. Basically it will read the table
and populate the memory object.
* It will allow API's to add or remove any black listed item into/from
table. At the same time, it will update the memory object accordingly.
* It will allow an API to verify if an item is black-listed or not.
* It will periodically ping all the whitelisted servers, to see if the
server is up. If any server is down, it will put that server into blacklist.
Question: what is the easy way of pinging hadoop? Action Item: Needs to talk to
hadoop team to get a non-blocking API call , if any.
* HadoopAccessorService(HAS) will need to check if NN/JT is already in the
black list. If it is in the blacklist return an error/exception.
* Each caller of HAS.createJobClient or HAS.createFileSystem needs to handle
the above error case. For the time being, it will update the job/action record
with last modified time without doing any hadoop action. It relies on the
RecoveryService to pick up later for retry.
* Currently any job submission will fail immediately, if the corresponding NN
is in blacklist.
Note: This implementation is not to solve the whole problem. Some future work
will be needed to improve the functionality.
> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
> Key: OOZIE-103
> URL: https://issues.apache.org/jira/browse/OOZIE-103
> Project: Oozie
> Issue Type: Bug
> Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get
> directory listing on hdfs; how long to submit a job or query JT queue)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira