Re: isSplitable() problem

Harsh J Mon, 23 Apr 2012 04:52:55 -0700

Dan,

Split and reading a whole file as a chunk are two slightly different
things. The former controls if your files ought to be split across
mappers (useful if there are multiple blocks of file in HDFS). The
latter needs to be achieved differently.

The TextInputFormat provides by default a LineRecordReader, which as
it name goes - reads whatever stream is provided to it line-by-line.
This is regardless of the file's block splits (a very different thing
than line splits).

You need to implement your own "RecordReader" and return it from your
InputFormat to do what you want it to - i.e. read the whole stream
into an object and then pass it out to the Mapper.

On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew <wirefr...@googlemail.com> wrote:
> I require each input file to be processed by each mapper as a whole.
>
> I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override
> isSplitable() to invariably return false.
>
> The job is configured to use this subclass as the input format class via
> setInputFormatClass(). The job runs without error, yet the logs reveal
> files are still processed line by line by the mappers.
>
> Any help would be greatly appreciated,
> Thanks

-- 
Harsh J

Re: isSplitable() problem

Reply via email to