Hadoop provides the DistributedCache API for this. See
http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html

On Wed, Jul 11, 2012 at 9:15 AM, GUOJUN Zhu <guojun_...@freddiemac.com> wrote:
>
> Hi,
>
> I am using the programmatic call to initialize the hadoop job.
> ("jobClient.submitJob( m_JobConf )")   I need to put a big object in
> distributed cache.  So I serialize it and send it over.  With the
> ToolRunner, I can use -file and the file has been sent over into the job
> directory and different jobs have no conflict.  However, there is no such
> thing in the programmatic submission.
>
> I originally just upload the file into hdfs and then add the hdfs address
> into distributed cache.  But to avoid the multiple job conflicts, I would
> like to add the jobID as a prefix or suffix to the remote name, however, I
> cannot access jobID until the submitJob() call which is too late for
> uploading files to HDFS.
>
> Alternatively, I read through the source code, I added the properties
> "tmpfiles" into jobConf object before the submitJob() call.
>  conf.set( "tmpfiles", output.makeQualified( localFs ).toUri() + "#" +
> symlink );
> This seems the internal mechnism of the "-file" option.  But it feels very
> hacky.  It would be nice that Hadoop provides some more formal way to handle
> this.  Thanks.
>
> BTW: I am using 0.20.2 (CDH3u3)
>
> Zhu, Guojun
> Modeling Sr Graduate
> 571-3824370
> guojun_...@freddiemac.com
> Financial Engineering
> Freddie Mac



-- 
Harsh J

Reply via email to