[ 
https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642884#action_12642884
 ] 

Amareshwari Sriramadasu commented on HADOOP-4472:
-------------------------------------------------

The following is the approach for separating initialization of setup/cleanup 
tasks out of initTasks.
1. Remove the initialization from initTasks and add it in the constructor of 
JobInProgress. The IDs of setup and cleanup tasks can be -1 and -2.
2. JobTracker does not inform the listeners when the job is submitted, and it 
waits for the setup completion. 
JT can poll the waiting jobs to see if setup is complete for them, but this 
will be done in heartbeat which becomes expensive. Otherwise JIP can tell the 
JT that setup is complete by an api JobTracker.setupComplete,  through which 
the JT informs other listeners about the job. 
Then the jobs will be added to the listeners in the order of setup completion.
3. Once the initTasks is done, the job state is moved to RUNNING state. Since 
initTasks can be done by any one, they have to inform Jobtracker about the 
state change. This will be easier once HADOOP-4521 comes. For now we can have 
the package-private api in JobTracker.

Thoughts?

Moreover, If the initTasks is done asynchronously due to HADOOP-4513, we 
wouldn't need this change.

> Should we move out the creation of setup/cleanup tasks from 
> JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>            Assignee: Amareshwari Sriramadasu
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the 
> newly-introduced setup and cleanup tasks. initTasks() is called by the 
> schedulers, as for reasons of memory optimizations, schedulers may choose to 
> initialize M/R tasks at various moments (the Capacity Scheduler, for example, 
> calls initTasks() just when it considers a job for running). One can say that 
> Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 
> 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are 
> unaware of these tasks). This causes a problematic dependency between the JT 
> and a Scheduler. For example, the Capacity Scheduler calls initTasks() and 
> immediately calls JobInProgress.obtainNewMapTask for a map task. This is a 
> problem today, because we cannot run any map or reduce tasks before the setup 
> task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their 
> dependencies with M/R tasks (in which case, Schedulers 'own' the creation and 
> scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup 
> tasks and Schedulers are completely unaware of them (in which case, the 
> creation of setup/cleanup tasks must be moved out of initTasks into a 
> separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, 
> in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to