[ https://issues.apache.org/jira/browse/MAPREDUCE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dick King updated MAPREDUCE-1295: --------------------------------- Status: Patch Available (was: Open) This is the folder. The command line looks like this: {noformat} Folder [ -output-duration duration ] \ [ -input-cycle duration ] \ [ -concentration ratio ] \ [ -seed seed ] \ [ -temp temp-directory ] \ [ -debug ] \ [ -skew-buffer-length n ] \ [ -allow-missorting ] \ input-path output-path {noformat} All paths and directories are general {{Path}}s. {{-output-duration}} is the length of the range of the submit times of the output trace. {{-input-cycle}} is the length of the input job submit time cycle. For example, if {{-input-cycle}} is 2 hours, then 3PM and 1PM is treated alike. {{-concentration}} is a double, a ratio of the density [number of jobs starting per hour] in the output over the input. This can be less than or greater than 1.0. {{-seed}} is a random number generator seed, used to create repeatable runs if desired. If no {{-seed}} is provided, we state what the {{-seed}} should be if you want to repeat this run. {{-temp}} is a temp directory, which must be able to hold about as much data as the input contains, compressed. Trace data compresses at about 17:1. This defaults to the directory of the output. The temporary files are erased whether the job succeeds or fails, unless {{-debug}} is coded. {{-debug}} induces the tool to produce a lot of debugging output, and causes the itnermediate files to be retained after a run. {{-skew-buffer-length}} describes the length of the skew buffer, which defaults to 0. The input to the folder should be sorted, and the job tracker log output is approximately sorted. However, there are occasional small glitches in the job tracker logs, jobs that come out a few places earlier than they should to be ordered by submit time. Code a {{-skew-buffer-length}} of {{i > 0}} to allow as many as {{i}} jobs to arrive earlier than they're supposed to and be buffered until they can be released. {{-allow-missorting}} instructs the folder to just drop a job that arrives later than it should, if there's not enough room in the skew buffer for it. If {{-allow-missorting}} is not coded, we abend the run instead. If the run is successful, either because {{-allow-missorting}} is coded or {{-skew-buffer-length}} is big enough, the folder tells the user what the smallest {{-skew-buffer-length}} could have been for the run to succeed without omitting any jobs. > We need a job trace manipulator to build gridmix runs. > ------------------------------------------------------ > > Key: MAPREDUCE-1295 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1295 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Dick King > Assignee: Dick King > Attachments: mapreduce-1297--2009-12-14.patch > > > Rumen produces "job traces", which are JSON format files describing important > aspects of all jobs that are run [successfully or not] on a hadoop map/reduce > cluster. There are two packages under development that will consume these > trace files and produce actions in that cluster or another cluster: gridmix3 > [see jira MAPREDUCE-1124 ] and Mumak [a simulator -- see MAPREDUCE-728 ]. > It would be useful to be able to do two things with job traces, so we can run > experiments using these two tools: change the duration, and change the > density. I would like to provide a "folder", a tool that can wrap a > long-duration execution trace to redistribute its jobs over a shorter > interval, and also change the density by duplicating or culling away jobs > from the folded combined job trace. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.