Sounds like hbase could work for you (random-access to big data). It comes with Input/OutputFormats so you can hook it up to your MR job as source or sink. Come hang on the hbase-user mailing list if you have more questions.
St.Ack On Fri, Oct 16, 2009 at 3:20 PM, Sujit Pal <[email protected]> wrote: > Sorry, HDFS should have been HBase. > > -sujit > > On Fri, 2009-10-16 at 14:36 -0700, Sujit Pal wrote: > > Hi, > > > > I have a situation where I need to "collect" data into some sort of > > common medium from a set of mapreduce jobs, then have another mapreduce > > job "consolidate" these to provide the final result. I was considering > > using some sort of database to store the output of the first stage and > > then read them (I need to be able to do random access on the keys) in > > the second stage. > > > > I thought of using HDFS and a colleague suggested Apache Cassandra. Both > > seem to be implementations of BigTable. I read that HDFS is a file > > handle hog, but no such thing on the Cassandra site. Would it be > > preferable, in your opinion, to use one over the other? I suppose I > > should just try them both, but if someone has done this already, would > > appreciate their input before doing this. > > > > Thanks > > Sujit > > > >
