On Feb 3, 2011, at 9:16 AM, Keith Wiley wrote:
> I've seen this asked before, but haven't seen a response yet.
>
> If the input to a streaming job is not actual data splits but simple HDFS
> file names which are then read by the mappers, then how can data locality be
> achieved.
If I understand your question, the method of processing doesn't matter.
The JobTracker places tasks based on input locality. So if you are providing
the names of the file you want as input as -input, then the JT will use the
locations of those blocks. (Remember: streaming.jar is basically a big wrapper
around the Java methods and the parameters you pass to it are essentially the
same as you'd provide to a "real" Java app.)
Or are you saying your -input is a list of other files to read? In the
case, there is no locality. But again, streaming or otherwise makes no real
difference.
> Likewise, is there any easier way to make those files accessible other than
> using the -cacheFile flag?
> That requires building a very very long hadoop command (100s of files
> potentially). I'm worried about overstepping some command-line length
> limit...plus it would be nice to do this programatically, say with the
> DistributedCache.addCacheFile() command, but that requires writing your own
> driver, which I don't see how to do with streaming.
>
> Thoughts?
I think you need to give a more concrete example of what you are doing.
-cache is used for sending files with your job and should have no bearing on
what your input is to your job. Something tells me that you've cooked
something up that is overly complex. :D