Hi,
I have a couple of problems that I think the development team could enhance.  
I'm currently running a job that takes a whole day to finish.
 
1) Adjusting input set dynamically
At the start, I had 9090 gzipped input data files for the job,
    07/09/24 10:26:06 INFO mapred.FileInputFormat: Total input paths to process 
: 9090

Then I realized there were 3 files that were bad (couldn't be gunzipped).  
So, I removed them by doing,
    bin/hadoop  dfs  -rm  srcdir/FILExxx.gz

20 hours later, the job was failed.  And, I found a few errors in the log:
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open 
filename ...FILExxx.gz

Is it possible that the runtime could adjust the input data set accordingly?

2) Checking the output directory first
I started my job with the standard command line,
    bin/hardoop  jar  myjob.jar  srcdir  resultdir

Then, after many long hours, the job was about to finish with
    ...INFO mapred.JobClient:  map 100% reduce 100%
But, it ended up with
    Exception in thread "main" 
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
...resultdir already exists

Can we check the existence of the output directory at the very beginning, to 
save us a day?

Thanks,
Nathan

Reply via email to