On Sep 15, 2008, at 11:24 AM, Kayla Jay wrote:
How does one do a check or guarantee there's enough disk space when
running a hadoop job that you're not sure how much it will produce
in its results (temp files, etc) ?
In 0.19 there is new code that waits until the first N% of maps are
run and estimates the amount of space required for each of the
following tasks. You can see the discussion here:
https://issues.apache.org/jira/browse/HADOOP-657
The task tracker can also set the mapred.local.dir.minspacestart
variable, which controls the minimum amount of disk space that must be
free before it will ask for a new task.
Or, what if you run out of disk space on the HDFS if you are running
large jobs with large outputs ? The job just fails .. but how can
one assess this resource allocation of disk space while running
your jobs?
Map/Reduce works by re-executing tasks that fail, including tasks that
fail for lack of disk space. If the task fails, the partial results
are erased on the assumption that they will be run later. The tasks
that finish, will have their output in the output directory, even if
the job fails.
-- Owen