I have a problem where there is a single, relatively small (10-20 MB) input file. (It happens it is a fasta file which will have meaning if you are a biologist.) I am already using a custom InputFormat and a custom reader to force a custom parsing. The file may generate tens or hundreds of millions of key value pairs and the mapper does a fair amount of work on each record. The standard implementation of * public List<InputSplit> getSplits(JobContext job) throws IOException {*
uses fs.getFileBlockLocations(file, 0, length); to determine the blocks and for a file of this size will come up with a single InputSplit and a single mapper. I am looking for a good example of forcing the generation of multiple InputSplits for a small file. In this case I am happy if every Mapper instance is required to read and parse the entire file as long as I can guarantee that every record is processed by only a single mapper. While I think I see how I might modify* getSplits(JobContext job) *I am not sure how and when the code is called when the job is running on the cluster. -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com