[jira] Updated: (MAPREDUCE-1295) We need a job trace manipulator to build gridmix runs.

Dick King (JIRA) Tue, 15 Dec 2009 10:05:41 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dick King updated MAPREDUCE-1295:
---------------------------------

    Status: Patch Available  (was: Open)

This is the folder.

The command line looks like this:

{noformat}
   Folder [ -output-duration duration ]           \
          [ -input-cycle duration     ]           \
          [ -concentration ratio      ]           \
          [ -seed seed                ]           \
          [ -temp temp-directory      ]           \
          [ -debug                    ]           \
          [ -skew-buffer-length n     ]           \
          [ -allow-missorting         ]           \
       input-path output-path
{noformat}

All paths and directories are general {{Path}}s.

{{-output-duration}} is the length of the range of the submit times of the 
output trace.

{{-input-cycle}} is the length of the input job submit time cycle.  For 
example, if {{-input-cycle}} is 2 hours, then 3PM and 1PM is treated alike.

{{-concentration}} is a double, a ratio of the density [number of jobs starting 
per hour] in the output over the input.  This can be less than or greater than 
1.0.

{{-seed}} is a random number generator seed, used to create repeatable runs if 
desired.  If no {{-seed}} is provided, we state what the {{-seed}} should be if 
you want to repeat this run.

{{-temp}} is a temp directory, which must be able to hold about as much data as 
the input contains, compressed.  Trace data compresses at about 17:1.  This 
defaults to the directory of the output.  The temporary files are erased 
whether the job succeeds or fails, unless {{-debug}} is coded.

{{-debug}} induces the tool to produce a lot of debugging output, and causes 
the itnermediate files to be retained after a run.

{{-skew-buffer-length}} describes the length of the skew buffer, which defaults 
to 0.  The input to the folder should be sorted, and the job tracker log output 
is approximately sorted.  However, there are occasional small glitches in the 
job tracker logs, jobs that come out a few places earlier than they should to 
be ordered by submit time.  Code a {{-skew-buffer-length}} of {{i > 0}} to 
allow as many as {{i}} jobs to arrive earlier than they're supposed to and be 
buffered until they can be released.

{{-allow-missorting}} instructs the folder to just drop a job that arrives 
later than it should, if there's not enough room in the skew buffer for it.  If 
{{-allow-missorting}} is not coded, we abend the run instead.

If the run is successful, either because {{-allow-missorting}} is coded or 
{{-skew-buffer-length}} is big enough, the folder tells the user what the 
smallest {{-skew-buffer-length}} could have been for the run to succeed without 
omitting any jobs.

> We need a job trace manipulator to build gridmix runs.
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1295
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1295
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Dick King
>            Assignee: Dick King
>         Attachments: mapreduce-1297--2009-12-14.patch
>
>
> Rumen produces "job traces", which are JSON format files describing important 
> aspects of all jobs that are run [successfully or not] on a hadoop map/reduce 
> cluster.  There are two packages under development that will consume these 
> trace files and produce actions in that cluster or another cluster: gridmix3 
> [see jira MAPREDUCE-1124 ] and Mumak [a simulator -- see MAPREDUCE-728 ].
> It would be useful to be able to do two things with job traces, so we can run 
> experiments using these two tools: change the duration, and change the 
> density.  I would like to provide a "folder", a tool that can wrap a 
> long-duration execution trace to redistribute its jobs over a shorter 
> interval, and also change the density by duplicating or culling away jobs 
> from the folded combined job trace.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1295) We need a job trace manipulator to build gridmix runs.

Reply via email to