Re: Hadoop Data Sharing

Renato Marroquín Mogrovejo Sat, 15 May 2010 19:05:40 -0700

Thanks for your replies. Yeah I have had to restructure a part of my code
but it is all good now.
Thanks again for your suggestions.


Renato M.

2010/5/11 Jay Booth <[email protected]>

> Probably the most direct route to get your desired result is to save
> the objects to either a SequenceFile or plain text file on DFS.  Then
> in the configure() section of your mapreduce jobs, you open the file
> on DFS, stream contents into a local variable and refer to it as  you
> need to.  Either way, you'll need some sort of serialization via
> Writable or plain text.
>
> On Tue, May 11, 2010 at 4:19 PM, Renato Marroquín Mogrovejo
> <[email protected]> wrote:
> > Hi Aaron,
> >
> > The thing is that I had a data structure that is saved into a vector, and
> > this vector needs to be available for my MapReduce jobs while iterating.
> So
> > would you think it would a good and easy way to serialize this objects?
> It's
> > a vector that each node contains another user define data structure.
> Maybe I
> > will try to do it first just using files, and see how the throughput
> goes.
> > Hey do you know where I can find some examples of serializing objects for
> > Hadoop to save them into SequenceFiles?
> > Thanks in advance.
> >
> > Renato M.
> >
> >
> > 2010/5/11 Aaron Kimball <[email protected]>
> >
> >> Perhaps this is guidance in the area you were hoping for: If your data
> is
> >> in
> >> objects that implement the interface 'Writable', then you can use the
> >> SequenceFileOutputFormat and SequenceFileInputFormat to store your
> >> intermediate data in binary form in disk-backed files called
> SequenceFiles.
> >> The serialization will happen through the write() and readFields()
> methods
> >> of your objects, which will automatically be called by the
> >> OutputFormat/InputFormat as they move through the system. So your
> >> subsequent
> >> MR pass will receive objects back in the same form as they were emitted.
> >> This is a considerably better idea (from both a throughput and a sanity
> >> perspective) in a chained MapReduce job.
> >>
> >> - Aaron
> >>
> >> On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <[email protected]>
> >> wrote:
> >>
> >> > What objects are you referring to? I'm not sure I understand your
> >> question.
> >> > - Aaron
> >> >
> >> >
> >> > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo <
> >> > [email protected]> wrote:
> >> >
> >> >> Thanks Aaron! I was thinking the same after doing some reading.
> >> >> Man what about serialize the objects? Would you think that is a good
> >> idea?
> >> >> Thanks again.
> >> >>
> >> >> Renato M.
> >> >>
> >> >>
> >> >> 2010/5/5 Aaron Kimball <[email protected]>
> >> >>
> >> >> > Renato,
> >> >> >
> >> >> > In general if you need to perform a multi-pass MapReduce workflow,
> >> each
> >> >> > pass
> >> >> > materializes its output to files. The subsequent pass then reads
> those
> >> >> same
> >> >> > files back in as input. This allows the workflow to start at the
> last
> >> >> > "checkpoint" if it gets interrupted. There is no persistent
> in-memory
> >> >> > distributed storage feature in Hadoop that would allow a MapReduce
> job
> >> >> to
> >> >> > post results to memory for consumption by a subsequent job.
> >> >> >
> >> >> > So you would just read your initial data from /input, and write
> your
> >> >> > interim
> >> >> > results to /iteration0. Then the next pass reads from /iteration0
> and
> >> >> > writes
> >> >> > to /iteration1, etc..
> >> >> >
> >> >> > If your data is reasonably small and you think it could fit in
> memory
> >> >> > somewhere, then you could experiment with using other distributed
> >> >> key-value
> >> >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
> >> >> > results.
> >> >> > But this will require some integration work on your part.
> >> >> > - Aaron
> >> >> >
> >> >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
> >> >> > [email protected]> wrote:
> >> >> >
> >> >> > > Hi everyone, I have recently started to play around with hadoop,
> but
> >> I
> >> >> am
> >> >> > > getting some into some "design" problems.
> >> >> > > I need to make a loop to execute the same job several times, and
> in
> >> >> each
> >> >> > > iteration get the processed values (not using a file because I
> would
> >> >> need
> >> >> > > to
> >> >> > > read it). I was using an static vector in my main class (the one
> >> that
> >> >> > > iterates and executes the job in each iteration) to retrieve
> those
> >> >> > values,
> >> >> > > and it did work while I was using a standalone mode. Now I tried
> to
> >> >> test
> >> >> > it
> >> >> > > on a pseudo-distributed manner and obviously is not working.
> >> >> > > Any suggestions, please???
> >> >> > >
> >> >> > > Thanks in advance,
> >> >> > >
> >> >> > >
> >> >> > > Renato M.
> >> >> > >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
>

Re: Hadoop Data Sharing

Reply via email to