So, the issue is that the input path I specified was a directory, not a file.
As a result, Hadoop helpfully assumed that I wanted a file called "data" in that directory to be the input, and proceeded down the path with that assumption, instead of failing fast. I had to go to the source code to figure out why it was doing this. I'm finding that Hadoop has this sort of behavior (assume a useless default instead of failing fast) in a number of locations, some of them highly problematic, such as the dreaded DrWho default user.) it was only after reading http://blog.rapleaf.com/dev/?p=382 that I figured out why some of my services are losing data - due to the hadoop libs falling back to DrWho under strange conditions, then throwing a permissions exception when attempting to write a file, which subsequently kills a buffer-flush thread of a long-lived process... It would be very helpful if Hadoop were to fail fast when encountering incorrect configuration rather than assuming a default which will essentially never be used in a production environment. Both of these issues have cost me far more time and money in lost business ($50k just this week thanks do DrWho) than failing fast would have done. Thanks, Kris On Wed, Apr 7, 2010 at 6:23 AM, Sonal Goyal <[email protected]> wrote: > hi Kris, > > Seems your program can not find the input file. Have you done a hadoop fs > -ls to verify that the file exists? Also, the path URL should be > hdfs://...... > > > Thanks and Regards, > Sonal > www.meghsoft.com > > > On Wed, Apr 7, 2010 at 1:16 AM, Kris Nuttycombe <[email protected]> > wrote: >> >> Exception in thread "main" java.io.FileNotFoundException: File does >> not exist: hdfs:///test-batchEventLog/metrics/data >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) >> at >> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) >> at >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) >> at >> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) >> at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) >> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) >> at reporting.HDFSMapReduceQuery.execute(HDFSMetricsQuery.scala:60) >> >> My job config contains the following: >> >> println("using input path: " + inPath) >> println("using output path: " + outPath) >> FileInputFormat.setInputPaths(job, inPath); >> FileOutputFormat.setOutputPath(job, outPath) >> >> with input & output paths printed out as: >> >> using input path: hdfs:/test-batchEventLog >> using output path: >> hdfs:/test-batchEventLog/out/03d24392-9bd9-4b23-8240-aceb54b3473c >> >> Any ideas why this would be occurring? >> >> Thanks, >> >> Kris > >
