[jira] Commented: (MAPREDUCE-776) Gridmix: Trace-based benchmark for Map/Reduce

Chris Douglas (JIRA) Mon, 20 Jul 2009 19:33:41 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733437#action_12733437
 ]


Chris Douglas commented on MAPREDUCE-776:
-----------------------------------------

*Version 1*

The following is a list of features we plan to develop that distinguish it from 
its predecessors:

* Workload is generated from job history trace analyzer Rumen ( MAPREDUCE-751 ).
* Model the details of IO workload:
** Much larger working set.
** Input and output data volumes for every task.
** Input and output record counts for every task.
* Model the details of memory resource usage.
* Model the job submission rates

The main purpose of Gridmix is to evaluate MapReduce and HDFS performance. It 
will not, and is not intended to, capture improvements to layers on top of 
MapReduce. As its predecessors were, Gridmix will be principally a job 
submission client (though collecting high-level metrics- as was attempted in 
Gridmix2- would be a natural extension) will be developed in several stages. A 
usable and useful V1.0 of the submitting client must satisfy the following 
requirements in each functional categories listed below.

*Simplifying Assumptions*

The context for the following list is unpacked in subsequent sections. Briefly, 
the job properties that will not be closely reproduced:

* _CPU usage_. We have no data for per-task CPU usage, so we cannot attempt 
even an approximation (e.g. tasks should never be CPU bound, though this surely 
happens in practice).
* _Filesystem properties._ We will make no attempt to match block sizes, 
namespace hierarchies, or any property of input, intermediate, or output data 
other than the bytes/records consumed and emitted from a given task.
* _I/O rates._ Related to CPU usage, the rate at which a given tasks 
consumes/emits records is assumed to be 1) limited only by the speed of the 
reader/writer and 2) constant throughout the job.
* _Memory profile._ No information on tasks' memory use is available, other 
than the heap size requested. Given data on the frequency and size of 
allocations, a more thorough profile could be compiled and reproduced.
* _Skew._ We assume records read and output by a task follow observed averages. 
This is optimistic, as fewer buffers will require resizing, etc. Further, we 
assume that each map generates a proportional percentage of each reduce's input.
* _Job failure._ We assume that framework and hardware faults are the only 
cause of task failure, i.e. user code is correct.
* _Job independence._ The output or outcome of one job does not affect when or 
whether another job will run.

*Components*

* *Data Generation.* When run on a cluster without sufficient data- a common 
case for testing new versions of Hadoop- the submitting client should have a 
data generation phase to populate the cluster with a configurable volume of 
data.
** _Reasonable distribution._ We do not yet have sufficient data to generate a 
simulated block map. In the future, we may discover that this requires some 
processing of/from the Rumen trace to write data similar to that of the actual 
jobs. For now, we require only that the input data are distributed more-or-less 
evenly across the cluster, to avoid artifical hot-spots in the Gridmix data.
** _Content._ Since the map will employ a trivial binary reader, there are no 
constraints on the generated data. Future versions of Gridmix could produce 
compressible or compressed data- or even mock data to be consumed by a 
non-trival reader- but the only requirement we impose in V1.0 is that the 
generated data are random (i.e. not trivially compressible).
** _Namespace._ The generated data will make no attempt to emulate the block 
size, file hierarchy, or any other property of the data over which the user 
jobs originally ran. The generated data will be written into a flat namespace.
* *Generic MR job.* Gridmix must have a generic job to effect the workloads in 
the trace it receives from Rumen. For the first version, our focus will be on 
reproducing I/O traffic and memory.
** _Task configuration._ The number of maps and reduces must match the Rumen 
trace description.
** _Memory usage._ Lacking explicit data on actual memory usage in task JVMs, 
we assume that each task will consume and use most of the heap allocated to it. 
This is a pessimistic estimate- since it is expected that most users accept the 
default setting and use considerably less- but it will be predictable.
** _Map fidelity._ Each map will read the number of records and bytes specified 
in the Rumen trace. It is assumed that the map will output its records at an 
even rate, so records will be output at a rate from the map- per input record- 
that will effect the correct number of output records. Without data on the 
record read rate (and CPU usage), we cannot simulate effort expended per 
record, but the map itself will do some nominal work. We do not control block 
size, so if the data read in the original user job had a different block size, 
then the percentage of remote and local reads will likely be affected.
** _Shuffle fidelity._ We do not have sufficient data to describe skewed task 
output, e.g. a job with more than one reduce, such that at least one reduce 
receives most of its input from a single map task. Instead, we will assume that 
each map generates a percentage of bytes and records equal to that received by 
a given reduce. For example, if reduce r receives i% of the input records and 
j% of the bytes in the shuffle, then map m will output i% of its output records 
such that j% of its bytes go to reduce r. This will, with some error, match the 
input record and byte counts to each reduce. Note that combiners in users jobs 
will affect byte and record counts we consider, but no combiner will be run. 
Whether Rumen reports the shuffled bytes/records as the map output- or the 
bytes output from the map prior to the combiner- will not be distinguished by 
the Gridmix submission client.
** _Reduce fidelity._ Each reduce will- by definition- receive the set of bytes 
and records assigned to it from each map. As in the map, it will output at a 
constant rate, relative to the input records, such that it will output the 
correct number of records and bytes as described in the Rumen trace.
* *Driver Program.*
** _Input parameters._ The Gridmix client will accept a Rumen trace (either 
from a URI or stdin) of jobs, work directory, and (optional) the size of the 
data to be generated. One data generating task will be started per TT node; 
each will generate the same amount of data.
** _Submission._ Given that Gridmix is simulating a set of submitting clients, 
it must be possible to submit jobs concurrently in case split generation would 
otherwise cause a deadline to be missed. (Un)Fortunately, scalability 
limitations at the JobTracker impose an upper bound on the submission rate the 
client can effect. Errors in submission must be reported at the client.
** _Cancellation and shutdown._ It should be possible to cancel a run of 
Gridmix at any time without leaving jobs running on the cluster. Gridmix must 
make a best-effort to kill any running jobs it has started before exiting 
through a shutdown hook.
** _Monitoring._ Though V1.0 will not attempt to monitor job dependencies, and 
will merely submit each job at its appointed deadline, Gridmix must monitor the 
jobs it has submitted and perform success/failure callbacks for every completed 
job. For V1.0, it is assumed that Rumen will provide all necessary intelligence 
for processing the run at the JobTracker.

*Future Work*

There are a number of ideas and approaches we elect to defer until either the 
need is demonstrated or more time is available for implementation. Among them:

* _Job throttling._ Rather than attempt any sort of backoff at the client in 
response to JobTracker load, we take the position that a trace is an inviolate 
submission profile. This allows us to make comparisons between two runs of the 
same trace. However, if this is to replace GridMix and GridMix2 as the de facto 
load generator or if one wants to throttle submission on evidence that it 
interferes with some unrelated measurement, such throttling may become 
attractive.
* _Compression._ The compressibility of input, intermediate, and output data 
can be mined from task logs. It would be possible to generate data that would 
roughly matched the observed "compressibility" of user jobs, but this would 
require a targeted effort.
* _Designed jobs._ Given that MapReduce has a range of widely used libraries- 
whose development is a considerable portion of JIRA traffic- tracking 
improvements to this core functionality in proportion to their prevalence in 
user jobs would aid in identifying bottlenecks in the application layer. 
Support for specifying InputFormats, OutputFormats, and even commonly used 
Mappers and Reducers would augment the synthetic mix.
* _Failure._ Gridmix assumes that failures are caused by hardware or flaws in 
the framework, so it makes no attempt to inject failures into a particular run. 
Since this does not match some of the preliminary insights into the causes of 
task failure, it may be argued that the Rumen trace should include- and Gridmix 
should honor- application failures. Tracing where in the input the failure 
occurs- and which component fails- would be a way to characterize and measure 
approaches to mitigating resource waste. In particular, one may improve the 
proportion of "useful" work- and thus throughput- with a better strategy for 
punishing/throttling jobs that will ultimately fail.
* _Job dependencies._ Modeling dependencies between jobs would remove a lower 
bound on job throughput by starting dependent jobs immediately once their 
prerequisites have been met. In general, one can imagine a submission client 
that would accept a list of prerequisites of which a submission time is merely 
the most common.
* _Isolation._ We assume that Gridmix "owns" the cluster it's running on, i.e. 
the only jobs running on the cluster are those submitted by an instance of the 
client. Supporting load generation alongside user tasks would be another use 
for throttling job submission (and a possible alternative to very fine 
simulations of specific user tasks).

> Gridmix: Trace-based benchmark for Map/Reduce
> ---------------------------------------------
>
>                 Key: MAPREDUCE-776
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-776
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Chris Douglas
>            Assignee: Chris Douglas
>
> Previous benchmarks ( HADOOP-2369 , HADOOP-3770 ), while informed by 
> production jobs, were principally load generating tools used to validate 
> stability and performance under saturation. The important dimensions of that 
> load- submission order/rate, I/O profile, CPU usage, etc- only accidentally 
> match that of the real load on the cluster. Given related work that 
> characterizes production load ( MAPREDUCE-751 ), it would be worthwhile to 
> use mined data to impose a corresponding load for tuning and guiding 
> development of the framework.
> The first version will focus on modeling task I/O, submission, and memory 
> usage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-776) Gridmix: Trace-based benchmark for Map/Reduce

Reply via email to