Hi,
I have a couple of problems that I think the development team could enhance.
I'm currently running a job that takes a whole day to finish.
1) Adjusting input set dynamically
At the start, I had 9090 gzipped input data files for the job,
07/09/24 10:26:06 INFO mapred.FileInputFormat: Total input paths to process
: 9090
Then I realized there were 3 files that were bad (couldn't be gunzipped).
So, I removed them by doing,
bin/hadoop dfs -rm srcdir/FILExxx.gz
20 hours later, the job was failed. And, I found a few errors in the log:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open
filename ...FILExxx.gz
Is it possible that the runtime could adjust the input data set accordingly?
2) Checking the output directory first
I started my job with the standard command line,
bin/hardoop jar myjob.jar srcdir resultdir
Then, after many long hours, the job was about to finish with
...INFO mapred.JobClient: map 100% reduce 100%
But, it ended up with
Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
...resultdir already exists
Can we check the existence of the output directory at the very beginning, to
save us a day?
Thanks,
Nathan