I know this works for 0.18.x. I'm not using 0.20 yet but as long as the API
hasn't changed to much this should be pretty straightforward.
Path prevOutputPath = new Path("...");
for (FileStatus fstatus : hdfs.listStatus(prevOutputPath)) {
if (!fstatus.isDir()) {
DistributedCache.addCacheFile(fstatus.getPath().toUri(), conf);
}
}
-----Original Message-----
From: Tiago Veloso [mailto:[email protected]]
Sent: Monday, April 26, 2010 12:11 PM
To: [email protected]
Cc: Tiago Veloso
Subject: Re: Chaining M/R Jobs
On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote:
> I don't usually bother renaming the files. If you know you want all of
> the files, you just iterate over the files in the output directory from
> the previous job. And then add those to distributed cache. If the data
> is fairly small you can set the number of reducers to 1 on the previous
> step as well.
And how do I Iterate on a directory? Could you give me a sample code?
If relevant I am using hadoop 0.20.2.
Tiago Veloso
[email protected]