Re: Does anyone have sample code for forcing a custom InputFormat to use a small split

Steve Lewis Mon, 12 Sep 2011 18:11:59 -0700

Thanks - what NLineInputFormat is pretty close to what I want.
In most cases the file is text and quite splittable
 although it raises another issue - sometimes the file is compressed - even
though it may
only be tens of megs compression is useful to speed transport
In the case of a small file with enough work in the mapper it may be useful
to split even a zipped file -
even if it means reading from the beginning to reach a specific index in the
unzipped stream -
ever seen that done??



On Mon, Sep 12, 2011 at 1:36 AM, Harsh J <ha...@cloudera.com> wrote:

> Hello Steve,
>
> On Mon, Sep 12, 2011 at 7:57 AM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
> > I have a problem where there is a single, relatively small (10-20 MB)
> input
> > file. (It happens it is a fasta file which will have meaning if you are a
> > biologist.)  I am already using a custom  InputFormat  and a custom
> reader
> > to force a custom parsing. The file may generate tens or hundreds of
> > millions of key value pairs and the mapper does a fair amount of work on
> > each record.
> > The standard implementation of
> >   public List<InputSplit> getSplits(JobContext job) throws IOException {
> >
> > uses fs.getFileBlockLocations(file, 0, length); to determine the blocks
> and
> > for a file of this size will come up with a single InputSplit and a
> single
> > mapper.
> > I am looking for a good example of forcing the generation of multiple
> > InputSplits for a small file. In this case I am  happy if every Mapper
> > instance is required to read and parse the entire file    as long as I
> can
> > guarantee that every record is processed by only a single mapper.
>
> Is the file splittable?
>
> You may look at the FileInputFormat's "mapred.min.split.size"
> property. See
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#setMinInputSplitSize(org.apache.hadoop.mapreduce.Job
> ,
> long)
>
> Perhaps the 'NLineInputFormat' may also be what you're really looking
> for, which lets you limit no. of records per mapper instead of
> fiddling around with byte sizes with the above.
>
> > While I think I see how I might modify  getSplits(JobContext job)  I am
> not
> > sure how and when the code is called when the job is running on the
> cluster.
>
> The method is called in the client-end, at the job-submission point.
>
> --
> Harsh J
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Does anyone have sample code for forcing a custom InputFormat to use a small split

Reply via email to