Re: Streaming data locality

Allen Wittenauer Thu, 03 Feb 2011 18:24:50 -0800

On Feb 3, 2011, at 9:16 AM, Keith Wiley wrote:

> I've seen this asked before, but haven't seen a response yet.
> 
> If the input to a streaming job is not actual data splits but simple HDFS 
> file names which are then read by the mappers, then how can data locality be 
> achieved.


        If I understand your question, the method of processing doesn't matter. 
 The JobTracker places tasks based on input locality.  So if you are providing 
the names of the file you want as input as -input, then the JT will use the 
locations of those blocks.  (Remember: streaming.jar is basically a big wrapper 
around the Java methods and the parameters you pass to it are essentially the 
same as you'd provide to a "real" Java app.)

        Or are you saying your -input is a list of other files to read?  In the 
case, there is no locality.  But again, streaming or otherwise makes no real 
difference.

> Likewise, is there any easier way to make those files accessible other than 
> using the -cacheFile flag?  
> That requires building a very very long hadoop command (100s of files 
> potentially).  I'm worried about overstepping some command-line length 
> limit...plus it would be nice to do this programatically, say with the 
> DistributedCache.addCacheFile() command, but that requires writing your own 
> driver, which I don't see how to do with streaming.
> 
> Thoughts?

        I think you need to give a more concrete example of what you are doing. 
 -cache is used for sending files with your job and should have no bearing on 
what your input is to your job.  Something tells me that you've cooked 
something up that is overly complex. :D

Re: Streaming data locality

Reply via email to