[ 
https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101840#comment-13101840
 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

mislam77 remarked:
Oozie team at Yahoo plans to implement the following idea:

* Oozie will only monitor whether hadoop is up or down. In this task, it will 
not consider if hadoop is slow.   
* Oozie will allow a way of blacklisting any hadoop JT or NN. There will be two 
ways to make a hadoop server  black-listed:
   * By admin: Admin can blacklist any hadoop server through command line 
interface. Most often for upgrade or maintenance.
   * By oozie daemon service: A dedicated thread will ping the hadoop server 
(that is not blacklisted by admin) and determine if the server needs to be 
blacklisted.

* How to remove items from the blacklist? There are also two ways:
   * Admin server can send WS request to take off the server from the 
blacklist. If a server is blacklisted by admin, that should be taken off by the 
admin only.
   * Oozie daemon thread will periodically ping  the hadoop server (that are 
not blacklisted by admin) and if it sees the server is up, it removes from the 
list.


Implementations:
===============
* A new WS endpoint for AdminServlet to allow admin to add/remove a blacklist 
item.
* A new table (name could be black_list_resources)  will be needed with the 
following columns:
   * Resource Name (e.g. http://localhost:9000/jobtracker)
   * Resource Type (hadoop-JT)
   * Creator (Admin/oozie)
   * Created time (UTC)
    * Last modified time (UTC)
    * More...
* A new monitor service needs to be implemented.  It will do the following 
tasks:
   * At the initialization, it will create an object to represent the contents 
of the current black-list-resources table.  Basically it will read the table 
and populate the memory object.
    * It will allow API's to add or remove any black listed item into/from 
table. At the same time, it will update the memory object accordingly.
    * It will allow an API to verify if an item is black-listed or not.
     * It will periodically ping all the whitelisted servers, to see if the 
server is up. If any server is down, it will put that server into blacklist. 
Question: what is the easy way of pinging hadoop? Action Item: Needs to talk to 
hadoop team to get a non-blocking API call , if any.
       

* HadoopAccessorService(HAS) will need to check if NN/JT is already in the 
black list. If it is in the blacklist return an error/exception.

* Each caller of HAS.createJobClient or HAS.createFileSystem needs to handle 
the above error case. For the time being, it will update the job/action record 
with last modified time without doing any hadoop action. It relies on the 
RecoveryService to pick up later for retry.

* Currently any job submission will fail immediately, if the corresponding NN 
is in blacklist.

Note: This implementation is not to solve the whole problem. Some future work 
will be needed to improve the functionality.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get 
> directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to