Hadoop provides the DistributedCache API for this. See http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html
On Wed, Jul 11, 2012 at 9:15 AM, GUOJUN Zhu <guojun_...@freddiemac.com> wrote: > > Hi, > > I am using the programmatic call to initialize the hadoop job. > ("jobClient.submitJob( m_JobConf )") I need to put a big object in > distributed cache. So I serialize it and send it over. With the > ToolRunner, I can use -file and the file has been sent over into the job > directory and different jobs have no conflict. However, there is no such > thing in the programmatic submission. > > I originally just upload the file into hdfs and then add the hdfs address > into distributed cache. But to avoid the multiple job conflicts, I would > like to add the jobID as a prefix or suffix to the remote name, however, I > cannot access jobID until the submitJob() call which is too late for > uploading files to HDFS. > > Alternatively, I read through the source code, I added the properties > "tmpfiles" into jobConf object before the submitJob() call. > conf.set( "tmpfiles", output.makeQualified( localFs ).toUri() + "#" + > symlink ); > This seems the internal mechnism of the "-file" option. But it feels very > hacky. It would be nice that Hadoop provides some more formal way to handle > this. Thanks. > > BTW: I am using 0.20.2 (CDH3u3) > > Zhu, Guojun > Modeling Sr Graduate > 571-3824370 > guojun_...@freddiemac.com > Financial Engineering > Freddie Mac -- Harsh J