On Tue, 7 Jun 2011 09:41:21 -0300, "Juan P." <[email protected]>
wrote:
> Not 100% clear on what you meant. You are saying I should put the file
into
> my HDFS cluster or should I use DistributedCache? If you suggest the
> latter,
> could you address my original question?
I mean that you can certainly get away with putting information into a
known place on HDFS and loading it in each mapper or reducer, but that may
become very inefficient as your problem scales up. Mostly I was responding
to Shi Yu's question about why the DC is even worth using at all.
As to your question, here's how I do it, which I think I basically lifted
from an example in The Definitive Guide. There may be better ways, though.
In my setup, I put files into the DC by getting Path objects (which should
be able to reference either HDFS or local filesystem files, though I always
have my files on HDFS to start) and using
DistributedCache.addCacheFile(path.toUri(), conf);
Then within my mapper or reducer I retrieve all the cached files with
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);
IIRC, this is what you were doing. The problem is this gets all the
cached files, although they are now in a working directory on the local
filesystem. Luckily, I know the filename of the file I want, so I iterate
for (Path cachePath : cacheFiles) {
if (cachePath.getName().equals(cachedFilename)) {
return cachePath;
}
}
Then I've got the path to the local filesystem copy of the file I want in
hand and I can do whatever I want with it.
hth