Re: DistributedCache

John Armstrong Tue, 07 Jun 2011 06:53:17 -0700

On Tue, 7 Jun 2011 09:41:21 -0300, "Juan P." <[email protected]>
wrote:
> Not 100% clear on what you meant. You are saying I should put the file
into
> my HDFS cluster or should I use DistributedCache? If you suggest the
> latter,
> could you address my original question?


I mean that you can certainly get away with putting information into a
known place on HDFS and loading it in each mapper or reducer, but that may
become very inefficient as your problem scales up.  Mostly I was responding
to Shi Yu's question about why the DC is even worth using at all.

As to your question, here's how I do it, which I think I basically lifted
from an example in The Definitive Guide.  There may be better ways, though.

In my setup, I put files into the DC by getting Path objects (which should
be able to reference either HDFS or local filesystem files, though I always
have my files on HDFS to start) and using

  DistributedCache.addCacheFile(path.toUri(), conf);

Then within my mapper or reducer I retrieve all the cached files with

  Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

IIRC, this is what you were doing.  The problem is this gets all the
cached files, although they are now in a working directory on the local
filesystem.  Luckily, I know the filename of the file I want, so I iterate

  for (Path cachePath : cacheFiles) {
    if (cachePath.getName().equals(cachedFilename)) {
      return cachePath;
    }
  }

Then I've got the path to the local filesystem copy of the file I want in
hand and I can do whatever I want with it.

hth

Re: DistributedCache

Reply via email to