On Feb 3, 2011, at 9:29 AM, David Rosenstrauch wrote: > On 02/03/2011 12:16 PM, Keith Wiley wrote: >> I've seen this asked before, but haven't seen a response yet. >> >> If the input to a streaming job is not actual data splits but simple >> HDFS file names which are then read by the mappers, then how can data >> locality be achieved. >> >> Likewise, is there any easier way to make those files accessible >> other than using the -cacheFile flag? That requires building a very >> very long hadoop command (100s of files potentially). I'm worried >> about overstepping some command-line length limit...plus it would be >> nice to do this programatically, say with the >> DistributedCache.addCacheFile() command, but that requires writing >> your own driver, which I don't see how to do with streaming. >> >> Thoughts? > > Submit the job in a Java app instead of via streaming? Have a big loop where > you repeatedly call job.addInputPath. (Or, if you're going to have a large > number of input files, use CombineFileInputFormat for efficiency.)
Well, I know how to write a typical Hadoop driver which "extends Configured implements Tool" if that's what you mean, but then how to I kick off a streaming job from that driver? I only know how to start a "normal" Java Hadoop job that way (via JobClient.runJob(conf);). How do I start a streaming job using that method? I only know how to start a streaming job by launching the streaming jar from the command line? Does my question make sense? ________________________________________________________________________________ Keith Wiley [email protected] keithwiley.com music.keithwiley.com "I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use." -- Galileo Galilei ________________________________________________________________________________
