Not sure if this helps in your use case but you can put all output file into distributed cache and then access them in the subsequent map-reduce job (in driver code):
// previous mr-job's output String pstr = "hdfs://<output_path/"; FileStatus[] files = fs.listStatus(new Path(pstr)); for (FileStatus f : files) { if (!f.isDir()) { DistributedCache.addCacheFile(f.getPath().toUri(), job.getConfiguration()); } } I think you can also copy these files to a different location in dfs and then put into distributed cache. Deniz On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote: > Hello, > > I have a MapFile as a product of MapReduce job, and what I need to do is: > > 1. If MapReduce produced more spilts as Output, merge them to single file. > > 2. Copy this merged MapFile to another HDFS location and use it as a > Distributed cache file for another MapReduce job. > > I'm wondering if it is even possible to merge MapFiles according to their > nature and use them as Distributed cache file. > > What I'm trying to achieve is repeatedly fast search in this file during > another MapReduce job. > If my idea is absolute wrong, can you give me any tip how to do it? > > The file is supposed to be 20MB large. > I'm using Hadoop 0.20.203. > > Thanks for your reply:) > > Ondrej Klimpera