[ http://issues.apache.org/jira/browse/HADOOP-719?page=comments#action_12449858 ] Mahadev konar commented on HADOOP-719: --------------------------------------
The three components of HOD are : bin/hod : This is the part of HOD that talks to the underlying scheduler/resource manager (condor, torque etc.) and asks for machines. This is the main component of HOD which allocates specified number of nodes and then monitors the HOD instance for any updates/ failuers. A HOD instance is a map reduce instance of Hadoop. bin/hod-tt : This is the script taht is run on the node which invokes a tasktracker instance. bin/hod-jt: This is the scipt that invokes the jobtracker instance on a given node. The basic idea being: --- The user invokes bin/hod with all the required arguments (including the job.jar and hadoop jars) --- bin/hod gets the required number of nodes from the underlying resource managers --- invokes bin/hod-jt on the machine that will be running the jobtracker --- invokes bin/hod-tt on the machine that will be running task trackers Here is a design document for HOD. ----- Design Document for bin/hod The two key design points for bin/hod are : 1. Abstract Factory that allows batch scheduler specific behaviours behind common API 2. event message loop that abstracts post-startup interaction between hadoop and batch scheduler. handles grow/shrink, termination, etc. The command line parameters for bin/hod are: -jar <map-reduce JAR> -resource-manager torque, condor, ... -min-node <n> the minimum number of nodes required for this map reduce instance -max-node <m> the maximum number of nodes required for this map reduce instance -qos <qos-level> -account <project-account-name> -wall-time <t> -S or -to-scheduler k=v,k1=v1,... -H or -to-hadoop k=v,... -J or -to-jobclient k=v,... and system-related ones like workdir, namenode addr, resource manager package dir, hadoop dir or tarball, etc The Classes that bin/hod will contain are the following: ClusterConfig-- This is an abstract factory for all the cluster configs. Can be instantiated to -- TorqueClusterConfig() CondorClusterConfig() ... etc These classes have information on what the cluster specific commands/configurations are- like submit() method would map to condor_submit in case of Condor and qsub in case of torque ClusterHead-- This abstract class maps to the JobTracker class. A proxy for the Jobtracker. ClusterBody-- This abstract class maps to the tasktrackers /worker nodes The above two classes can have batch scheduler specific ClusterHead/Body as there subclasses. There are two different classes for TaskTrackers/JobTracker since the reactions to events caused at at the JobTracker/Tasktrackers is different. ClusterMonitor-- This class monitors the map reduce instance including JobTracker/Tasktrackers/jobclient, polls them for events and handles all those events. This will also be extended to batch scheduler specific cluster monitor and would implement common functionality (like querying the jobtracker/tasktrackers) JobConfig-- This class handles the basic arguments of map/reduce jobs. The input dir/output dir and number of maps/reduces. Later we could use the input dir information to ask for nodes that are local/rack local to the input files for map. The pseudo code for bin/hod would look like-- main() Config cfg = new Config(args) // this is main command line parser JobConfig jc = new JobConfig(cfg) // uses the following args -- input dir, output dir, number of maps, number of reduces // The cluster config takes a jobconfig argument to create scheduler config that can specify what kind of nodes it wants (nodes local to some rack ?) ClusterConfig cc = ClusterConfig.createInstance(cfg, jc) // uses any scheduler specific arguments from cfg // The cluster head is the jobtracker is always launched first ClusterHead ch = ClusterHead.createInstance(cfg, cc) // gets the hadoop specific args from cfg // The clusterBody takes cluster head as the input parameter ClusterBody cb = ClusterBody.createInstance(cfg, ch, cc) JobClient jobclient = new JobClient(jobconf) jobclient.submitJob(ch) ClusterMonitor cm = new ClusterMonitor(ch, cb, jobclient) while(e = cm.newEvent()){ // the events might be of the type-- jobtracker failed, tasktracker failed, only a few reduces are left with most of nodes lying idle, etc... cm.process(e) } Comments? > Integration of Hadoop with batch schedulers > ------------------------------------------- > > Key: HADOOP-719 > URL: http://issues.apache.org/jira/browse/HADOOP-719 > Project: Hadoop > Issue Type: New Feature > Components: contrib/streaming > Reporter: Mahadev konar > Assigned To: Mahadev konar > > Hadoop On Demand (HOD) is an integration of Hadoop with batch schedulers like > Condor/torque/sun grid etc. Hadoop On Demand or HOD hereafter is a system > that populates a Hadoop instance using a shared batch scheduler. HOD will > find a requested number of nodes and start up Hadoop daemons on them. Users > map reduce jobs can then run on the hadoop instance. After the job is done, > HOD gives back the nodes to the shared batch scheduler. A group of users > will use HOD to acquire Hadoop instances of varying sizes and the batch > scheduler will schedule requests in a way that important jobs gain more > importance/resources and finish fast. Here are a list of requirements for HOD > and batch schedulers: > Key Requirements : > --- Should allocate the specified minimum number of nodes for a job > Many batch jobs can finish in time, only when enough resources are > allocated. Therefore batch scheduler should allocate the asked number of > nodes for a given job when the job starts. This is simple form of what's > known as gang scheduling. > Often the minimum nodes are not available right away, especially if the job > asked for a large number. The batch scheduler should support advance > reservation for important jobs so that the wait time can be determined. In > advance reservation, a reservation is created on earliest future point when > the preoccupied nodes become available. When nodes are currently idle but > booked by future reservations, batch scheduler is ok to give them to other > jobs to increase system utilization, but only when doing so does not delay > existing reservations. > --- run short urgent job without costing too much loss to long job. > Especially, should not kill job tracker of long job. > Some jobs, mostly short ones, are time sensitive and need urgent treatment. > Often, large portion of cluster nodes will be occupied by long running jobs. > Batch scheduler should be able to preempt long jobs and run urgent jobs. > Then, urgent jobs will finish quickly and long jobs can re-gain the nodes > afterward. > When preemption happens, HOD should minimize the loss to long jobs. > Especially, it should not kill job tracker of long job. > --- be able to dial up, at run time, share of resources for more important > projects. > Viewed at high level, a given cluster is shared by multiple projects. A > project consists of a number of jobs submitted by a group of users.Batch > scheduler should allow important projects to have more resources. This should > be tunable at run time as what projects deem more important may change over > time. > --- prevent malicious abuse of the system. > A shared cluster environment can be put in jeopardy if malicious or > erroneous job code does: > -- hold unneeded resources for a long period > -- use privileges for unworthy work > Such abuse can easily cause under-utilization or starvation of other jobs. > Batch scheduler should allow setting up policies for preventing resource > abuse by: > -- limit privileges to legitimate uses asking for proper amount > -- throttle peak use of resources per player > -- monitor and reduce starvation > --- The behavior should be simple and predictable > When status of the system is queried, we should be able to determine what > factors caused it to reach current status and what could be the future > behavior with or without our tuning on the system. > --- be portable to major resource managers > HOD design should be portable so that in future we are able to plugin > other resource manager. > Some of the key requirements are implemented by the batch schedulers. The > others need to be implemented by HOD. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira