> 20 hours later, the job was failed. And, I found a few > errors in the log: > org.apache.hadoop.ipc.RemoteException: > java.io.IOException: Cannot open filename ...FILExxx.gz > > Is it possible that the runtime could adjust the input data > set accordingly?
If I remember right, the implementation of the packaged inputformat/recordreaders in hadoop doesn't do this check. But if you implement your own inputformat/recordreader or subclass an existing recordreader. If you take org.apache.hadoop.mapred.LineRecordReader as an example for this discussion, you could have this check ignored in the LineRecordReader's contructor. Also next(K,V) method would have to check whether the inputstream is null and return false in the very first call (apart from possibly some other things). So, in summary it is possible. > Then, after many long hours, the job was about to finish with > ...INFO mapred.JobClient: map 100% reduce 100% But, it > ended up with > Exception in thread "main" > org.apache.hadoop.mapred.FileAlreadyExistsException: Output > directory ...resultdir already exists This should be handled by the OutputFormat.checkOutputSpecs, and if you used an OutputFormat implementation supplied in hadoop that extends from org.apache.hadoop.mapred.OutputFormatBase, this check would immediately fail jobs whose output dir already exists. Which outputformat are you using? > -----Original Message----- > From: Nathan Wang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, September 25, 2007 11:00 PM > To: [email protected] > Subject: A couple of usability problems > > Hi, > I have a couple of problems that I think the development team > could enhance. > I'm currently running a job that takes a whole day to finish. > > 1) Adjusting input set dynamically > At the start, I had 9090 gzipped input data files for the job, > 07/09/24 10:26:06 INFO mapred.FileInputFormat: Total > input paths to process : 9090 > > Then I realized there were 3 files that were bad (couldn't be > gunzipped). > So, I removed them by doing, > bin/hadoop dfs -rm srcdir/FILExxx.gz > > 20 hours later, the job was failed. And, I found a few > errors in the log: > org.apache.hadoop.ipc.RemoteException: > java.io.IOException: Cannot open filename ...FILExxx.gz > > Is it possible that the runtime could adjust the input data > set accordingly? > > 2) Checking the output directory first > I started my job with the standard command line, > bin/hardoop jar myjob.jar srcdir resultdir > > Then, after many long hours, the job was about to finish with > ...INFO mapred.JobClient: map 100% reduce 100% But, it > ended up with > Exception in thread "main" > org.apache.hadoop.mapred.FileAlreadyExistsException: Output > directory ...resultdir already exists > > Can we check the existence of the output directory at the > very beginning, to save us a day? > > Thanks, > Nathan >
