OK, that is encouraging. I'll take another pass at it. I succeeded yesterday with an in-memory only InputFormat, but only after I commented out some of the split referencing code, like the following in MapTask.java
if (instantiatedSplit instanceof FileSplit) { FileSplit fileSplit = (FileSplit) instantiatedSplit; job.set("map.input.file", fileSplit.getPath().toString()); job.setLong("map.input.start", fileSplit.getStart()); job.setLong("map.input.length", fileSplit.getLength()); } But maybe I simply need to override more methods in more of the embedded classes. You can see why I was wondering about the file system dependencies. Doug Cutting <[EMAIL PROTECTED] rg> To hadoop-user@lucene.apache.org 10/24/2007 09:02 cc AM Subject Re: InputFiles, Splits, Maps, Tasks Please respond to Questions 1.3 Base [EMAIL PROTECTED] e.apache.org Lance Amundsen wrote: > I am starting to wonder if it might be indeed impossible to get map jobs > running w/o writing to the file system.... as in, not w/o some major > changes to the job and task tracker code. > > I was thinking about creating an InputFormat that does no file I/O, instead > is queue based. As mappers start up, their getRecordReader calls get > re-directed to a remote queue to pull one or more records off of. But I am > starting to wonder if the file system dependencies in the code are such > that I could never completely avoid using files. Specifically, even if I > completely re-write an InputFormat, the framework is still going to try to > do Filesystem stuff on everything I return (the extensive internal use of > splits is baffling me some). Nothing internally should depend on an InputSplit representing a file. You do need to be able to generate the complete set of splits when the job is launched. So if you wanted maps to poll a queue for each task then you'd need to know how long the queue is when the job is launched so that you could generate the right number of polling splits. Doug