I've seen this asked before, but haven't seen a response yet. If the input to a streaming job is not actual data splits but simple HDFS file names which are then read by the mappers, then how can data locality be achieved.
Likewise, is there any easier way to make those files accessible other than using the -cacheFile flag? That requires building a very very long hadoop command (100s of files potentially). I'm worried about overstepping some command-line length limit...plus it would be nice to do this programatically, say with the DistributedCache.addCacheFile() command, but that requires writing your own driver, which I don't see how to do with streaming. Thoughts? Thanks. ________________________________________________________________________________ Keith Wiley [email protected] www.keithwiley.com "I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me." -- Abe (Grandpa) Simpson ________________________________________________________________________________
