On Sep 25, 2007, at 10:30 AM, Nathan Wang wrote:

1) Adjusting input set dynamically
At the start, I had 9090 gzipped input data files for the job,
07/09/24 10:26:06 INFO mapred.FileInputFormat: Total input paths to process : 9090

Then I realized there were 3 files that were bad (couldn't be gunzipped).
So, I removed them by doing,
    bin/hadoop  dfs  -rm  srcdir/FILExxx.gz

20 hours later, the job was failed. And, I found a few errors in the log: org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename ...FILExxx.gz

Is it possible that the runtime could adjust the input data set accordingly?

As Devaraj pointed out this is possible, but in general I think it is correct to make this an error. The planning for the job must happen at the beginning before the job is launched and once the map has been assigned a file, if the mapper can't read the assigned input, it is a fatal problem. If failures are tolerable for your application, you can set the percent of mappers and reducers that can fail before the job is killed.

Can we check the existence of the output directory at the very beginning, to save us a day?

It does already. That was done back before 0.1 in HADOOP-3. Was your program launching two jobs or something? Very strange.

-- Owen

Reply via email to