Thanks Leo. I appreciate your response. Let me explain my situation more precisely.
I am running a series of MR sub-jobs all harnessed together so they run as a single job. The last MR sub-job does nothing more than aggregate the output of the previous sub-job into a single file(s). It does this, by having but a single reducer. I could eliminate this aggregation sub-job if I could have the aforementioned previous sub-job insert its output into a database instead of hdfs. Doing this, would also eliminate my current dependance on MultipleOutputs. The trouble comes when the Reducer(s) cannot find the persistent objects hence the dreaded CNFE. I find this odd because they are in the same package as the Reducer. Your comment about the back end crying is duly noted. btw, MPI = Message Passing Interface? On 2 March 2012 10:30, Leo Leung <lle...@ddn.com> wrote: > Geoffry, > > Hadoop distributedCache (as of now) is used to "cache" M/R application > specific files. > These files are used by M/R app only and not the framework. (Normally as > side-lookup) > > You can certainly try to use Hibernate to query your SQL based back-end > within the M/R code. > But think of what happens when a few hundred or thousands of M/R task do > that concurrently. > Your back-end is going to cry. (if it can - before it dies) > > So IMO, prep your M/R job with distributedCache files (pull it down > first) is a better approach. > > Also, MPI is pretty much out of question (not baked into the framework). > You'll likely have to roll your own. (And try to trick the JobTracker in > not starting the same task) > > Anyone has a better solution for Geoffry? > > > > -----Original Message----- > From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com] > Sent: Friday, March 02, 2012 9:42 AM > To: common-user@hadoop.apache.org > Subject: Re: Hadoop and Hibernate > > This is a tardy response. I'm spread pretty thinly right now. > > DistributedCache< > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >is > apparently deprecated. Is there a replacement? I didn't see anything > about this in the documentation, but then I am still using 0.21.0. I have > to for performance reasons. 1.0.1 is too slow and the client won't have it. > > Also, the DistributedCache< > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >approach > seems only to work from within a hadoop job. i.e. From within a Mapper or > a Reducer, but not from within a Driver. I have libraries that I must > access both from both places. I take it that I am stuck keeping two copies > of these libraries in synch--Correct? It's either that, or copy them into > hdfs, replacing them all at the beginning of each job run. > > Looking for best practices. > > Thanks > > On 28 February 2012 10:17, Owen O'Malley <omal...@apache.org> wrote: > > > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts > > <geoffry.robe...@gmail.com> wrote: > > > > > If I create an executable jar file that contains all dependencies > > required > > > by the MR job do all said dependencies get distributed to all nodes? > > > > You can make a single jar and that will be distributed to all of the > > machines that run the task, but it is better in most cases to use the > > distributed cache. > > > > See > > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr > > ibutedCache > > > > > If I specify but one reducer, which node in the cluster will the > > > reducer run on? > > > > The scheduling is done by the JobTracker and it isn't possible to > > control the location of the reducers. > > > > -- Owen > > > > > > -- > Geoffry Roberts > -- Geoffry Roberts