This is a tardy response.  I'm spread pretty thinly right now.

DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>is
apparently deprecated.  Is there a replacement?  I didn't see anything
about this in the documentation, but then I am still using 0.21.0. I have
to for performance reasons.  1.0.1 is too slow and the client won't have
it.

Also, the 
DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>approach
seems only to work from within a hadoop job.  i.e. From within a
Mapper or a Reducer, but not from within a Driver.  I have libraries that I
must access both from both places.  I take it that I am stuck keeping two
copies of these libraries in synch--Correct?  It's either that, or copy
them into hdfs, replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley <omal...@apache.org> wrote:

> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
> <geoffry.robe...@gmail.com> wrote:
>
> > If I create an executable jar file that contains all dependencies
> required
> > by the MR job do all said dependencies get distributed to all nodes?
>
> You can make a single jar and that will be distributed to all of the
> machines that run the task, but it is better in most cases to use the
> distributed cache.
>
> See
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
>
> > If I specify but one reducer, which node in the cluster will the reducer
> > run on?
>
> The scheduling is done by the JobTracker and it isn't possible to
> control the location of the reducers.
>
> -- Owen
>



-- 
Geoffry Roberts

Reply via email to