Not sure if this helps in your use case but you can put all output file into 
distributed cache and then access them in the subsequent map-reduce job (in 
driver code):

        // previous mr-job's output
        String pstr = "hdfs://<output_path/";         
        FileStatus[] files = fs.listStatus(new Path(pstr));
        for (FileStatus f : files) {
                if (!f.isDir()) {
                        DistributedCache.addCacheFile(f.getPath().toUri(), 
job.getConfiguration());
                }
        }

I think you can also copy these files to a different location in dfs and then 
put into distributed cache.


Deniz 


On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote:

> Hello,
> 
> I have a MapFile as a product of MapReduce job, and what I need to do is:
> 
> 1. If MapReduce produced more spilts as Output, merge them to single file.
> 
> 2. Copy this merged MapFile to another HDFS location and use it as a 
> Distributed cache file for another MapReduce job.
> 
> I'm wondering if it is even possible to merge MapFiles according to their 
> nature and use them as Distributed cache file.
> 
> What I'm trying to achieve is repeatedly fast search in this file during 
> another MapReduce job.
> If my idea is absolute wrong, can you give me any tip how to do it?
> 
> The file is supposed to be 20MB large.
> I'm using Hadoop 0.20.203.
> 
> Thanks for your reply:)
> 
> Ondrej Klimpera

Reply via email to