On Sep 25, 2007, at 10:30 AM, Nathan Wang wrote:
1) Adjusting input set dynamically
At the start, I had 9090 gzipped input data files for the job,
07/09/24 10:26:06 INFO mapred.FileInputFormat: Total input
paths to process : 9090
Then I realized there were 3 files that were bad (couldn't be
gunzipped).
So, I removed them by doing,
bin/hadoop dfs -rm srcdir/FILExxx.gz
20 hours later, the job was failed. And, I found a few errors in
the log:
org.apache.hadoop.ipc.RemoteException: java.io.IOException:
Cannot open filename ...FILExxx.gz
Is it possible that the runtime could adjust the input data set
accordingly?
As Devaraj pointed out this is possible, but in general I think it is
correct to make this an error. The planning for the job must happen
at the beginning before the job is launched and once the map has been
assigned a file, if the mapper can't read the assigned input, it is a
fatal problem. If failures are tolerable for your application, you
can set the percent of mappers and reducers that can fail before the
job is killed.
Can we check the existence of the output directory at the very
beginning, to save us a day?
It does already. That was done back before 0.1 in HADOOP-3. Was your
program launching two jobs or something? Very strange.
-- Owen