Thanks - what NLineInputFormat is pretty close to what I want. In most cases the file is text and quite splittable although it raises another issue - sometimes the file is compressed - even though it may only be tens of megs compression is useful to speed transport In the case of a small file with enough work in the mapper it may be useful to split even a zipped file - even if it means reading from the beginning to reach a specific index in the unzipped stream - ever seen that done??
On Mon, Sep 12, 2011 at 1:36 AM, Harsh J <ha...@cloudera.com> wrote: > Hello Steve, > > On Mon, Sep 12, 2011 at 7:57 AM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > > I have a problem where there is a single, relatively small (10-20 MB) > input > > file. (It happens it is a fasta file which will have meaning if you are a > > biologist.) I am already using a custom InputFormat and a custom > reader > > to force a custom parsing. The file may generate tens or hundreds of > > millions of key value pairs and the mapper does a fair amount of work on > > each record. > > The standard implementation of > > public List<InputSplit> getSplits(JobContext job) throws IOException { > > > > uses fs.getFileBlockLocations(file, 0, length); to determine the blocks > and > > for a file of this size will come up with a single InputSplit and a > single > > mapper. > > I am looking for a good example of forcing the generation of multiple > > InputSplits for a small file. In this case I am happy if every Mapper > > instance is required to read and parse the entire file as long as I > can > > guarantee that every record is processed by only a single mapper. > > Is the file splittable? > > You may look at the FileInputFormat's "mapred.min.split.size" > property. See > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#setMinInputSplitSize(org.apache.hadoop.mapreduce.Job > , > long) > > Perhaps the 'NLineInputFormat' may also be what you're really looking > for, which lets you limit no. of records per mapper instead of > fiddling around with byte sizes with the above. > > > While I think I see how I might modify getSplits(JobContext job) I am > not > > sure how and when the code is called when the job is running on the > cluster. > > The method is called in the client-end, at the job-submission point. > > -- > Harsh J > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com