Hi Aaron, The thing is that I had a data structure that is saved into a vector, and this vector needs to be available for my MapReduce jobs while iterating. So would you think it would a good and easy way to serialize this objects? It's a vector that each node contains another user define data structure. Maybe I will try to do it first just using files, and see how the throughput goes. Hey do you know where I can find some examples of serializing objects for Hadoop to save them into SequenceFiles? Thanks in advance.
Renato M. 2010/5/11 Aaron Kimball <[email protected]> > Perhaps this is guidance in the area you were hoping for: If your data is > in > objects that implement the interface 'Writable', then you can use the > SequenceFileOutputFormat and SequenceFileInputFormat to store your > intermediate data in binary form in disk-backed files called SequenceFiles. > The serialization will happen through the write() and readFields() methods > of your objects, which will automatically be called by the > OutputFormat/InputFormat as they move through the system. So your > subsequent > MR pass will receive objects back in the same form as they were emitted. > This is a considerably better idea (from both a throughput and a sanity > perspective) in a chained MapReduce job. > > - Aaron > > On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <[email protected]> > wrote: > > > What objects are you referring to? I'm not sure I understand your > question. > > - Aaron > > > > > > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo < > > [email protected]> wrote: > > > >> Thanks Aaron! I was thinking the same after doing some reading. > >> Man what about serialize the objects? Would you think that is a good > idea? > >> Thanks again. > >> > >> Renato M. > >> > >> > >> 2010/5/5 Aaron Kimball <[email protected]> > >> > >> > Renato, > >> > > >> > In general if you need to perform a multi-pass MapReduce workflow, > each > >> > pass > >> > materializes its output to files. The subsequent pass then reads those > >> same > >> > files back in as input. This allows the workflow to start at the last > >> > "checkpoint" if it gets interrupted. There is no persistent in-memory > >> > distributed storage feature in Hadoop that would allow a MapReduce job > >> to > >> > post results to memory for consumption by a subsequent job. > >> > > >> > So you would just read your initial data from /input, and write your > >> > interim > >> > results to /iteration0. Then the next pass reads from /iteration0 and > >> > writes > >> > to /iteration1, etc.. > >> > > >> > If your data is reasonably small and you think it could fit in memory > >> > somewhere, then you could experiment with using other distributed > >> key-value > >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate > >> > results. > >> > But this will require some integration work on your part. > >> > - Aaron > >> > > >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < > >> > [email protected]> wrote: > >> > > >> > > Hi everyone, I have recently started to play around with hadoop, but > I > >> am > >> > > getting some into some "design" problems. > >> > > I need to make a loop to execute the same job several times, and in > >> each > >> > > iteration get the processed values (not using a file because I would > >> need > >> > > to > >> > > read it). I was using an static vector in my main class (the one > that > >> > > iterates and executes the job in each iteration) to retrieve those > >> > values, > >> > > and it did work while I was using a standalone mode. Now I tried to > >> test > >> > it > >> > > on a pseudo-distributed manner and obviously is not working. > >> > > Any suggestions, please??? > >> > > > >> > > Thanks in advance, > >> > > > >> > > > >> > > Renato M. > >> > > > >> > > >> > > > > >
