Hi, On Wed, Nov 17, 2010 at 7:52 PM, Bhaskar Ghosh <bjgin...@yahoo.co.in> wrote: > > ---I am reading files within a directory and also subdirectories.
Currently FileInputFormat lets you read files for MapReduce, but does not recurse into directories. Although globs are accepted in Path strings, for proper recursion you need to implement the logic inside your custom extended FileInputFormat yourself. > ---Processing one file at a time Doable by turning off file-splitting, or by creating custom SequenceFiles/HARs. > ---Writing all the processed output to a single output file. [One output > file per folder] Doable with single reducer, but why do you require a single file? > I think I need to give one file to one Mapper at a time, when all the > mappers combine, one single reducer should write to a single file. [as I > think we cannot write parallely to a single output file] There's a "getmerge" feature the Hadoop DFS utils provide to retrieve a DFS directory of outputs as a single file. You should use that feature instead of bottling your reduce phase with a single reducer instance (unless its a requirement of some sort). See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html for the exact command syntax. > Please suggest me (or point me to resources) so that I can: > a) My map function gets one file at a time (instead of one line at a time) I suggest pre-creating a Hadoop SequenceFile for this purpose, with the <Key, Value> being <Filename, Contents>. Another solution would be to use HAR. See http://www.cloudera.com/blog/2009/02/the-small-files-problem/ for some further discussion on this. > b) Should implementing a custom RecordReader and/or FileInputFormat allow me > to read files in subdirectories as well (one file at a time) ? FileInputFormat.isSplittable is a method that tells if the input files must be split into chunks for processing or not, and FileInputFormat.listStatus is a method that lists all files (FileStatus objects) in a directory to compute Mapper splits for. You should write a custom class extending and overriding these methods to ask it not to split files (false) and recurse yourself as required to provide a proper list of FileStatus objects back to the framework. (In trunk code, the recursion support has been added to FileInputFormat itself. See MAPREDUCE-1501 on Apache's JIRA for the specifics and a patch.) -- Harsh J www.harshj.com