Thanks for your replies. Yeah I have had to restructure a part of my code but it is all good now. Thanks again for your suggestions.
Renato M. 2010/5/11 Jay Booth <[email protected]> > Probably the most direct route to get your desired result is to save > the objects to either a SequenceFile or plain text file on DFS. Then > in the configure() section of your mapreduce jobs, you open the file > on DFS, stream contents into a local variable and refer to it as you > need to. Either way, you'll need some sort of serialization via > Writable or plain text. > > On Tue, May 11, 2010 at 4:19 PM, Renato Marroquín Mogrovejo > <[email protected]> wrote: > > Hi Aaron, > > > > The thing is that I had a data structure that is saved into a vector, and > > this vector needs to be available for my MapReduce jobs while iterating. > So > > would you think it would a good and easy way to serialize this objects? > It's > > a vector that each node contains another user define data structure. > Maybe I > > will try to do it first just using files, and see how the throughput > goes. > > Hey do you know where I can find some examples of serializing objects for > > Hadoop to save them into SequenceFiles? > > Thanks in advance. > > > > Renato M. > > > > > > 2010/5/11 Aaron Kimball <[email protected]> > > > >> Perhaps this is guidance in the area you were hoping for: If your data > is > >> in > >> objects that implement the interface 'Writable', then you can use the > >> SequenceFileOutputFormat and SequenceFileInputFormat to store your > >> intermediate data in binary form in disk-backed files called > SequenceFiles. > >> The serialization will happen through the write() and readFields() > methods > >> of your objects, which will automatically be called by the > >> OutputFormat/InputFormat as they move through the system. So your > >> subsequent > >> MR pass will receive objects back in the same form as they were emitted. > >> This is a considerably better idea (from both a throughput and a sanity > >> perspective) in a chained MapReduce job. > >> > >> - Aaron > >> > >> On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <[email protected]> > >> wrote: > >> > >> > What objects are you referring to? I'm not sure I understand your > >> question. > >> > - Aaron > >> > > >> > > >> > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo < > >> > [email protected]> wrote: > >> > > >> >> Thanks Aaron! I was thinking the same after doing some reading. > >> >> Man what about serialize the objects? Would you think that is a good > >> idea? > >> >> Thanks again. > >> >> > >> >> Renato M. > >> >> > >> >> > >> >> 2010/5/5 Aaron Kimball <[email protected]> > >> >> > >> >> > Renato, > >> >> > > >> >> > In general if you need to perform a multi-pass MapReduce workflow, > >> each > >> >> > pass > >> >> > materializes its output to files. The subsequent pass then reads > those > >> >> same > >> >> > files back in as input. This allows the workflow to start at the > last > >> >> > "checkpoint" if it gets interrupted. There is no persistent > in-memory > >> >> > distributed storage feature in Hadoop that would allow a MapReduce > job > >> >> to > >> >> > post results to memory for consumption by a subsequent job. > >> >> > > >> >> > So you would just read your initial data from /input, and write > your > >> >> > interim > >> >> > results to /iteration0. Then the next pass reads from /iteration0 > and > >> >> > writes > >> >> > to /iteration1, etc.. > >> >> > > >> >> > If your data is reasonably small and you think it could fit in > memory > >> >> > somewhere, then you could experiment with using other distributed > >> >> key-value > >> >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate > >> >> > results. > >> >> > But this will require some integration work on your part. > >> >> > - Aaron > >> >> > > >> >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < > >> >> > [email protected]> wrote: > >> >> > > >> >> > > Hi everyone, I have recently started to play around with hadoop, > but > >> I > >> >> am > >> >> > > getting some into some "design" problems. > >> >> > > I need to make a loop to execute the same job several times, and > in > >> >> each > >> >> > > iteration get the processed values (not using a file because I > would > >> >> need > >> >> > > to > >> >> > > read it). I was using an static vector in my main class (the one > >> that > >> >> > > iterates and executes the job in each iteration) to retrieve > those > >> >> > values, > >> >> > > and it did work while I was using a standalone mode. Now I tried > to > >> >> test > >> >> > it > >> >> > > on a pseudo-distributed manner and obviously is not working. > >> >> > > Any suggestions, please??? > >> >> > > > >> >> > > Thanks in advance, > >> >> > > > >> >> > > > >> >> > > Renato M. > >> >> > > > >> >> > > >> >> > >> > > >> > > >> > > >
