Be careful putting them in HDFS. It does not scale very well, as the number of file opens will be on the order of Number of Mappers * Number of Reducers. You can quickly do a denial of service on the namenode if you have a lot of mappers and reducers.
--Bobby Evans On 5/21/12 4:02 AM, "Harsh J" <ha...@cloudera.com> wrote: Biro, I guess you could write these archives onto HDFS, and have your reducers read it from a location there, but this method may be a bit ugly. See http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F for properly writing files from tasks onto a DFS, or look at MultipleOutputs API class. Depending on how large these files are, you can also perhaps ship them in via the KV pairs itself. A custom key or sort comparator can further ensure that they are delivered in the first iteration of the reducer - if the file is required before regular reduce() ops can begin. On Mon, May 21, 2012 at 1:42 PM, biro lehel <lehel.b...@yahoo.com> wrote: > Dear all, > > In my Mapper, I run a script that processes my set of input text files, > creates from them some other text files (this is done locally on the FS on my > nodes), and as a result, each MapTask will produce an archive as a result. My > issue is, that I'm looking for a way for the Reducer to "take" these archives > as some kind of an input. I understood that the communication between > Mapper-Reducer is done through the means of the key-value pairs in the > Context, but what I would need is the transferring of these archive files to > the respective Reducer (I would probably have one single Reducer, so all the > files should be transferred/copied there somehow). > > Is this possible? Is there a way to transfer files from Mapper to Reducer? If > not, what is the best approach in scenarios like mine? Any suggestions would > be greatly appreciated. > > Thank you in advance, > Lehel. > > > -- Harsh J