Thanks Leo.  I appreciate your response.

Let me explain my situation more precisely.

I am running a series of MR sub-jobs all harnessed together so they run as
a single job.  The last MR sub-job does nothing more than aggregate the
output of the previous sub-job into a single file(s).  It does this, by
having but a single reducer.  I could eliminate this aggregation sub-job if
I could have the aforementioned previous sub-job insert its output into a
database instead of hdfs.  Doing this, would also eliminate my current
dependance on MultipleOutputs.

The trouble comes when the Reducer(s) cannot find the persistent objects
hence the dreaded CNFE.  I find this odd because they are in the same
package as the Reducer.

Your comment about the back end crying is duly noted.

btw,
MPI = Message Passing Interface?

On 2 March 2012 10:30, Leo Leung
 <lle...@ddn.com> wrote:

> Geoffry,
>
>  Hadoop distributedCache (as of now) is used to "cache" M/R application
> specific files.
>  These files are used by M/R app only and not the framework. (Normally as
> side-lookup)
>
>  You can certainly try to use Hibernate to query your SQL based back-end
> within the M/R code.
>  But think of what happens when a few hundred or thousands of M/R task do
> that concurrently.
>  Your back-end is going to cry. (if it can - before it dies)
>
>  So IMO,  prep your M/R job with distributedCache files (pull it down
> first) is a better approach.
>
>  Also, MPI is pretty much out of question (not baked into the framework).
>  You'll likely have to roll your own.  (And try to trick the JobTracker in
> not starting the same task)
>
>  Anyone has a better solution for Geoffry?
>
>
>
> -----Original Message-----
> From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com]
> Sent: Friday, March 02, 2012 9:42 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Hadoop and Hibernate
>
> This is a tardy response.  I'm spread pretty thinly right now.
>
> DistributedCache<
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> >is
> apparently deprecated.  Is there a replacement?  I didn't see anything
> about this in the documentation, but then I am still using 0.21.0. I have
> to for performance reasons.  1.0.1 is too slow and the client won't have it.
>
> Also, the DistributedCache<
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> >approach
> seems only to work from within a hadoop job.  i.e. From within a Mapper or
> a Reducer, but not from within a Driver.  I have libraries that I must
> access both from both places.  I take it that I am stuck keeping two copies
> of these libraries in synch--Correct?  It's either that, or copy them into
> hdfs, replacing them all at the beginning of each job run.
>
> Looking for best practices.
>
> Thanks
>
> On 28 February 2012 10:17, Owen O'Malley <omal...@apache.org> wrote:
>
> > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
> > <geoffry.robe...@gmail.com> wrote:
> >
> > > If I create an executable jar file that contains all dependencies
> > required
> > > by the MR job do all said dependencies get distributed to all nodes?
> >
> > You can make a single jar and that will be distributed to all of the
> > machines that run the task, but it is better in most cases to use the
> > distributed cache.
> >
> > See
> > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
> > ibutedCache
> >
> > > If I specify but one reducer, which node in the cluster will the
> > > reducer run on?
> >
> > The scheduling is done by the JobTracker and it isn't possible to
> > control the location of the reducers.
> >
> > -- Owen
> >
>
>
>
> --
> Geoffry Roberts
>



-- 
Geoffry Roberts

Reply via email to