Perhaps this is guidance in the area you were hoping for: If your data is in objects that implement the interface 'Writable', then you can use the SequenceFileOutputFormat and SequenceFileInputFormat to store your intermediate data in binary form in disk-backed files called SequenceFiles. The serialization will happen through the write() and readFields() methods of your objects, which will automatically be called by the OutputFormat/InputFormat as they move through the system. So your subsequent MR pass will receive objects back in the same form as they were emitted. This is a considerably better idea (from both a throughput and a sanity perspective) in a chained MapReduce job.
- Aaron On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <[email protected]> wrote: > What objects are you referring to? I'm not sure I understand your question. > - Aaron > > > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo < > [email protected]> wrote: > >> Thanks Aaron! I was thinking the same after doing some reading. >> Man what about serialize the objects? Would you think that is a good idea? >> Thanks again. >> >> Renato M. >> >> >> 2010/5/5 Aaron Kimball <[email protected]> >> >> > Renato, >> > >> > In general if you need to perform a multi-pass MapReduce workflow, each >> > pass >> > materializes its output to files. The subsequent pass then reads those >> same >> > files back in as input. This allows the workflow to start at the last >> > "checkpoint" if it gets interrupted. There is no persistent in-memory >> > distributed storage feature in Hadoop that would allow a MapReduce job >> to >> > post results to memory for consumption by a subsequent job. >> > >> > So you would just read your initial data from /input, and write your >> > interim >> > results to /iteration0. Then the next pass reads from /iteration0 and >> > writes >> > to /iteration1, etc.. >> > >> > If your data is reasonably small and you think it could fit in memory >> > somewhere, then you could experiment with using other distributed >> key-value >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate >> > results. >> > But this will require some integration work on your part. >> > - Aaron >> > >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < >> > [email protected]> wrote: >> > >> > > Hi everyone, I have recently started to play around with hadoop, but I >> am >> > > getting some into some "design" problems. >> > > I need to make a loop to execute the same job several times, and in >> each >> > > iteration get the processed values (not using a file because I would >> need >> > > to >> > > read it). I was using an static vector in my main class (the one that >> > > iterates and executes the job in each iteration) to retrieve those >> > values, >> > > and it did work while I was using a standalone mode. Now I tried to >> test >> > it >> > > on a pseudo-distributed manner and obviously is not working. >> > > Any suggestions, please??? >> > > >> > > Thanks in advance, >> > > >> > > >> > > Renato M. >> > > >> > >> > >
