How does one do a check or guarantee there's enough disk space when running a 
hadoop job  that you're not sure how much it will produce in its results (temp 
files, etc) ?

I.e when you run a hadoop job and you're not exactly sure how much disk space 
it will eat up (given temp dirs), the job will fail if it does run out.

How do you guarantee while you're job is running that there's enough disk space 
on the nodes and kick off cleanup (so the job won't fail) if you're running 
into low disk space?

For example, if your maps are failing since there isn't enough  temporary disk 
space on your nodes while you run a job, how can you fix that up front prior to 
running or better yet while the job is running from causing a failed job? The 
outputs of maps are stored on the local-disk of the nodes  
on which they were executed, and if your nodes don't have  enough while running 
jobs, how can you fix this at run time?  Can I catch this condition at all?

Is there a way to fix this at run time?  How do others solve this issue when 
running jobs that you're not sure how much disk space it will consume?

-----------
Or, what if you run out of disk space on the HDFS if you are running large jobs 
with large outputs ?  The job just fails .. but how can one assess this  
resource allocation of disk space while running your jobs?

If you run out of HDFS disk space, and you know you want the results of job X, 
is there a way to find out while running that you can do some smart cleanup as 
to not lose what data could've been produced by job X?



      

Reply via email to