[ 
https://issues.apache.org/jira/browse/MAPREDUCE-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154778#comment-13154778
 ] 

Tamas Sarlos commented on MAPREDUCE-728:
----------------------------------------

Hi Arun,

We do not control or simulate anything about the placement of a job's _output_ 
data.  As for the effects of task placement and the _input_ data of a job, we 
rely on the input split locations recorded in the Hadoop job logs. The locality 
of the map task's input plays a role in determining the run time of the map 
task in the simulation. We differentiate among 3 levels of locality based on 
the closest input split to the task tracker: same node, same rack, and cross 
rack. We parse the Hadoop job logs using org.apache.hadoop.tools.rumen.

In details, with pointers to the code:
SimulatorTaskTracker.java:738 sets the run time of the task in the
simulation; the relevant
SimulatorTaskTracker.SimulatorTaskInProgress.userSpaceRunTime comes from
org.apache.hadoop.tools.rumen.TaskAttemptInfo,
org.apache.hadoop.tools.rumen.MapTaskAttemptInfo, and
org.apache.hadoop.tools.rumen.ReduceTaskAttemptInfo

These are set by the SimulatorJobTracker.java:443
using SimulatorJobInProgress.getTaskAttemptInfo(taskTracker, taskAttemptID)
which uses SimulatorJobStory, which is just a wrapper around the
org.apache.hadoop.tools.rumen.JobStory interface.

The latter is actually implemented by org.apache.hadoop.tools.rumen.ZombieJob, 
which uses the logged runtime of task attempts and simple heuristics based on 
the locality of the taskTracker to the input splits to make up a run time for 
the task.

If you wanted to alter the effect of task placement on run time, we suggest to 
modify the ZombieJob class.

Please let us know if we understood your question correctly.

Best,
  Tamas and Anirban

                
> Mumak: Map-Reduce Simulator
> ---------------------------
>
>                 Key: MAPREDUCE-728
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-728
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>            Reporter: Arun C Murthy
>            Assignee: Hong Tang
>             Fix For: 0.21.0
>
>         Attachments: 19-jobs.topology.json.gz, 19-jobs.trace.json.gz, 
> mapreduce-728-20090917-3.patch, mapreduce-728-20090917-4.patch, 
> mapreduce-728-20090917.patch, mapreduce-728-20090918-2.patch, 
> mapreduce-728-20090918-3.patch, mapreduce-728-20090918-5.patch, 
> mapreduce-728-20090918-6.patch, mapreduce-728-20090918.patch, mumak.png
>
>
> h3. Vision:
> We want to build a Simulator to simulate large-scale Hadoop clusters, 
> applications and workloads. This would be invaluable in furthering Hadoop by 
> providing a tool for researchers and developers to prototype features (e.g. 
> pluggable block-placement for HDFS, Map-Reduce schedulers etc.) and predict 
> their behaviour and performance with reasonable amount of confidence, 
> there-by aiding rapid innovation.
> ----
> h3. First Cut: Simulator for the Map-Reduce Scheduler
> The Map-Reduce Scheduler is a fertile area of interest with at least four 
> schedulers, each with their own set of features, currently in existence: 
> Default Scheduler, Capacity Scheduler, Fairshare Scheduler & Priority 
> Scheduler.
> Each scheduler's scheduling decisions are driven by many factors, such as 
> fairness, capacity guarantee, resource availability, data-locality etc.
> Given that, it is non-trivial to accurately choose a single scheduler or even 
> a set of desired features to predict the right scheduler (or features) for a 
> given workload. Hence a simulator which can predict how well a particular 
> scheduler works for some specific workload by quickly iterating over 
> schedulers and/or scheduler features would be quite useful.
> So, the first cut is to implement a simulator for the Map-Reduce scheduler 
> which take as input a job trace derived from production workload and a 
> cluster definition, and simulates the execution of the jobs in as defined in 
> the trace in this virtual cluster. As output, the detailed job execution 
> trace (recorded in relation to virtual simulated time) could then be analyzed 
> to understand various traits of individual schedulers (individual jobs turn 
> around time, throughput, faireness, capacity guarantee, etc). To support 
> this, we would need a simulator which could accurately model the conditions 
> of the actual system which would affect a schedulers decisions. These include 
> very large-scale clusters (thousands of nodes), the detailed characteristics 
> of the workload thrown at the clusters, job or task failures, data locality, 
> and cluster hardware (cpu, memory, disk i/o, network i/o, network topology) 
> etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to