This is a tardy response. I'm spread pretty thinly right now. DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>is apparently deprecated. Is there a replacement? I didn't see anything about this in the documentation, but then I am still using 0.21.0. I have to for performance reasons. 1.0.1 is too slow and the client won't have it.
Also, the DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>approach seems only to work from within a hadoop job. i.e. From within a Mapper or a Reducer, but not from within a Driver. I have libraries that I must access both from both places. I take it that I am stuck keeping two copies of these libraries in synch--Correct? It's either that, or copy them into hdfs, replacing them all at the beginning of each job run. Looking for best practices. Thanks On 28 February 2012 10:17, Owen O'Malley <omal...@apache.org> wrote: > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts > <geoffry.robe...@gmail.com> wrote: > > > If I create an executable jar file that contains all dependencies > required > > by the MR job do all said dependencies get distributed to all nodes? > > You can make a single jar and that will be distributed to all of the > machines that run the task, but it is better in most cases to use the > distributed cache. > > See > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > > > If I specify but one reducer, which node in the cluster will the reducer > > run on? > > The scheduling is done by the JobTracker and it isn't possible to > control the location of the reducers. > > -- Owen > -- Geoffry Roberts