I've seen this asked before, but haven't seen a response yet.

If the input to a streaming job is not actual data splits but simple HDFS file 
names which are then read by the mappers, then how can data locality be 
achieved.

Likewise, is there any easier way to make those files accessible other than 
using the -cacheFile flag?  That requires building a very very long hadoop 
command (100s of files potentially).  I'm worried about overstepping some 
command-line length limit...plus it would be nice to do this programatically, 
say with the DistributedCache.addCacheFile() command, but that requires writing 
your own driver, which I don't see how to do with streaming.

Thoughts?

Thanks.

________________________________________________________________________________
Keith Wiley               [email protected]               www.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
  -- Abe (Grandpa) Simpson
________________________________________________________________________________



Reply via email to