> 20 hours later, the job was failed.  And, I found a few 
> errors in the log:
>     org.apache.hadoop.ipc.RemoteException: 
> java.io.IOException: Cannot open filename ...FILExxx.gz
> 
> Is it possible that the runtime could adjust the input data 
> set accordingly?

If I remember right, the implementation of the packaged
inputformat/recordreaders in hadoop doesn't do this check. But if you
implement your own inputformat/recordreader or subclass an existing
recordreader. If you take org.apache.hadoop.mapred.LineRecordReader as an
example for this discussion, you could have this check ignored in the
LineRecordReader's contructor. Also next(K,V) method would have to check
whether the inputstream is null and return false in the very first call
(apart from possibly some other things). So, in summary it is possible.

> Then, after many long hours, the job was about to finish with
>     ...INFO mapred.JobClient:  map 100% reduce 100% But, it 
> ended up with
>     Exception in thread "main" 
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output 
> directory ...resultdir already exists

This should be handled by the OutputFormat.checkOutputSpecs, and if you used
an OutputFormat implementation supplied in hadoop that extends from
org.apache.hadoop.mapred.OutputFormatBase, this check would immediately fail
jobs whose output dir already exists. Which outputformat are you using?

> -----Original Message-----
> From: Nathan Wang [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, September 25, 2007 11:00 PM
> To: [email protected]
> Subject: A couple of usability problems
> 
> Hi,
> I have a couple of problems that I think the development team 
> could enhance.  
> I'm currently running a job that takes a whole day to finish.
>  
> 1) Adjusting input set dynamically
> At the start, I had 9090 gzipped input data files for the job,
>     07/09/24 10:26:06 INFO mapred.FileInputFormat: Total 
> input paths to process : 9090
> 
> Then I realized there were 3 files that were bad (couldn't be 
> gunzipped).  
> So, I removed them by doing,
>     bin/hadoop  dfs  -rm  srcdir/FILExxx.gz
> 
> 20 hours later, the job was failed.  And, I found a few 
> errors in the log:
>     org.apache.hadoop.ipc.RemoteException: 
> java.io.IOException: Cannot open filename ...FILExxx.gz
> 
> Is it possible that the runtime could adjust the input data 
> set accordingly?
> 
> 2) Checking the output directory first
> I started my job with the standard command line,
>     bin/hardoop  jar  myjob.jar  srcdir  resultdir
> 
> Then, after many long hours, the job was about to finish with
>     ...INFO mapred.JobClient:  map 100% reduce 100% But, it 
> ended up with
>     Exception in thread "main" 
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output 
> directory ...resultdir already exists
> 
> Can we check the existence of the output directory at the 
> very beginning, to save us a day?
> 
> Thanks,
> Nathan
> 

Reply via email to