On Feb 3, 2011, at 9:29 AM, David Rosenstrauch wrote:

> On 02/03/2011 12:16 PM, Keith Wiley wrote:
>> I've seen this asked before, but haven't seen a response yet.
>> 
>> If the input to a streaming job is not actual data splits but simple
>> HDFS file names which are then read by the mappers, then how can data
>> locality be achieved.
>> 
>> Likewise, is there any easier way to make those files accessible
>> other than using the -cacheFile flag?  That requires building a very
>> very long hadoop command (100s of files potentially).  I'm worried
>> about overstepping some command-line length limit...plus it would be
>> nice to do this programatically, say with the
>> DistributedCache.addCacheFile() command, but that requires writing
>> your own driver, which I don't see how to do with streaming.
>> 
>> Thoughts?
> 
> Submit the job in a Java app instead of via streaming?  Have a big loop where 
> you repeatedly call job.addInputPath.  (Or, if you're going to have a large 
> number of input files, use CombineFileInputFormat for efficiency.)


Well, I know how to write a typical Hadoop driver which "extends Configured 
implements Tool" if that's what you mean, but then how to I kick off a 
streaming job from that driver?  I only know how to start a "normal" Java 
Hadoop job that way (via JobClient.runJob(conf);).  How do I start a streaming 
job using that method?  I only know how to start a streaming job by launching 
the streaming jar from the command line?

Does my question make sense?

________________________________________________________________________________
Keith Wiley     [email protected]     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________

Reply via email to