Re: How to read whole files and output processed texts to another file through MapReduce

Harsh J Wed, 17 Nov 2010 09:41:33 -0800

Hi,

On Wed, Nov 17, 2010 at 7:52 PM, Bhaskar Ghosh <bjgin...@yahoo.co.in> wrote:
>
> ---I am reading files within a directory and also subdirectories.


Currently FileInputFormat lets you read files for MapReduce, but does
not recurse into directories. Although globs are accepted in Path
strings, for proper recursion you need to implement the logic inside
your custom extended FileInputFormat yourself.

> ---Processing one file at a time

Doable by turning off file-splitting, or by creating custom SequenceFiles/HARs.

> ---Writing all the processed output to a single output file. [One output
> file per folder]

Doable with single reducer, but why do you require a single file?

> I think I need to give one file to one Mapper at a time, when all the
> mappers combine, one single reducer should write to a single file. [as I
> think we cannot write parallely to a single output file]

There's a "getmerge" feature the Hadoop DFS utils provide to retrieve
a DFS directory of outputs as a single file. You should use that
feature instead of bottling your reduce phase with a single reducer
instance (unless its a requirement of some sort).

See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html for
the exact command syntax.

> Please suggest me (or point me to resources) so that I can:
> a) My map function gets one file at a time (instead of one line at a time)

I suggest pre-creating a Hadoop SequenceFile for this purpose, with
the <Key, Value> being <Filename, Contents>. Another solution would be
to use HAR. See
http://www.cloudera.com/blog/2009/02/the-small-files-problem/ for some
further discussion on this.

> b) Should implementing a custom RecordReader and/or FileInputFormat allow me
> to read files in subdirectories as well (one file at a time) ?

FileInputFormat.isSplittable is a method that tells if the input files
must be split into chunks for processing or not, and
FileInputFormat.listStatus is a method that lists all files
(FileStatus objects) in a directory to compute Mapper splits for.

You should write a custom class extending and overriding these methods
to ask it not to split files (false) and recurse yourself as required
to provide a proper list of FileStatus objects back to the framework.

(In trunk code, the recursion support has been added to
FileInputFormat itself. See MAPREDUCE-1501 on Apache's JIRA for the
specifics and a patch.)

-- 
Harsh J
www.harshj.com

Re: How to read whole files and output processed texts to another file through MapReduce

Reply via email to