[ 
https://issues.apache.org/jira/browse/MAPREDUCE-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728412#action_12728412
 ] 

Matei Zaharia commented on MAPREDUCE-728:
-----------------------------------------

This is looking good!

I have one item of high-level feedback. It looks like Mumak has two components 
- a simulator and a trace-driven workload generator. It would be nice if the 
workload generator was pluggable so that the simulator could be used on 
synthetic workloads without requiring a trace. For example, one should be able 
to create a simulated cluster where some given node is always slow, or fails 
partway through, etc. Then the simulator could be used in unit tests, 
simplifying a lot of the testing code in various schedulers.

Also, some questions about things that will be difficult to simulate:
* What will be done about speculative tasks? The trace currently shows a second 
attempt being started and a first being killed. One option would be to make the 
first attempt take forever, but then you'd have to decide when to mark the task 
as speculatable in the simulated JobInProgress. Another option might be to 
always use the time of the fastest non-killed task attempt and forget about 
simulation in V1.
* Will Mumak simulate high-memory jobs? That's one of the more interesting 
scheduling problems.
* The schedulers and the JobTracker currently have some threads that perform an 
operation periodically and sleep in-between doing so. To make these work in a 
simulator, I think we have to make these pieces of code not use threads, and 
include an API in the JobTracker such as schedulePeriodically(Runnable 
runnable, long interval) so that these threads can run in simulated time.
* Calls to System.currentTimeMillis will have to be replaced by use of Clock 
throughout the schedulers.

> Mumak: Map-Reduce Simulator
> ---------------------------
>
>                 Key: MAPREDUCE-728
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-728
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.21.0
>
>         Attachments: mumak.png
>
>
> h3. Vision:
> We want to build a Simulator to simulate large-scale Hadoop clusters, 
> applications and workloads. This would be invaluable in furthering Hadoop by 
> providing a tool for researchers and developers to prototype features (e.g. 
> pluggable block-placement for HDFS, Map-Reduce schedulers etc.) and predict 
> their behaviour and performance with reasonable amount of confidence, 
> there-by aiding rapid innovation.
> ----
> h3. First Cut: Simulator for the Map-Reduce Scheduler
> The Map-Reduce Scheduler is a fertile area of interest with at least four 
> schedulers, each with their own set of features, currently in existence: 
> Default Scheduler, Capacity Scheduler, Fairshare Scheduler & Priority 
> Scheduler.
> Each scheduler's scheduling decisions are driven by many factors, such as 
> fairness, capacity guarantee, resource availability, data-locality etc.
> Given that, it is non-trivial to accurately choose a single scheduler or even 
> a set of desired features to predict the right scheduler (or features) for a 
> given workload. Hence a simulator which can predict how well a particular 
> scheduler works for some specific workload by quickly iterating over 
> schedulers and/or scheduler features would be quite useful.
> So, the first cut is to implement a simulator for the Map-Reduce scheduler 
> which take as input a job trace derived from production workload and a 
> cluster definition, and simulates the execution of the jobs in as defined in 
> the trace in this virtual cluster. As output, the detailed job execution 
> trace (recorded in relation to virtual simulated time) could then be analyzed 
> to understand various traits of individual schedulers (individual jobs turn 
> around time, throughput, faireness, capacity guarantee, etc). To support 
> this, we would need a simulator which could accurately model the conditions 
> of the actual system which would affect a schedulers decisions. These include 
> very large-scale clusters (thousands of nodes), the detailed characteristics 
> of the workload thrown at the clusters, job or task failures, data locality, 
> and cluster hardware (cpu, memory, disk i/o, network i/o, network topology) 
> etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to