How does one do a check or guarantee there's enough disk space when running a
hadoop job that you're not sure how much it will produce in its results (temp
files, etc) ?
I.e when you run a hadoop job and you're not exactly sure how much disk space
it will eat up (given temp dirs), the job will fail if it does run out.
How do you guarantee while you're job is running that there's enough disk space
on the nodes and kick off cleanup (so the job won't fail) if you're running
into low disk space?
For example, if your maps are failing since there isn't enough temporary disk
space on your nodes while you run a job, how can you fix that up front prior to
running or better yet while the job is running from causing a failed job? The
outputs of maps are stored on the local-disk of the nodes
on which they were executed, and if your nodes don't have enough while running
jobs, how can you fix this at run time? Can I catch this condition at all?
Is there a way to fix this at run time? How do others solve this issue when
running jobs that you're not sure how much disk space it will consume?
-----------
Or, what if you run out of disk space on the HDFS if you are running large jobs
with large outputs ? The job just fails .. but how can one assess this
resource allocation of disk space while running your jobs?
If you run out of HDFS disk space, and you know you want the results of job X,
is there a way to find out while running that you can do some smart cleanup as
to not lose what data could've been produced by job X?