Sounds like hbase could work for you (random-access to big data).  It comes
with Input/OutputFormats so you can hook it up to your MR job as source or
sink.  Come hang on the hbase-user mailing list if you have more questions.

St.Ack

On Fri, Oct 16, 2009 at 3:20 PM, Sujit Pal <[email protected]> wrote:

> Sorry, HDFS should have been HBase.
>
> -sujit
>
> On Fri, 2009-10-16 at 14:36 -0700, Sujit Pal wrote:
> > Hi,
> >
> > I have a situation where I need to "collect" data into some sort of
> > common medium from a set of mapreduce jobs, then have another mapreduce
> > job "consolidate" these to provide the final result. I was considering
> > using some sort of database to store the output of the first stage and
> > then read them (I need to be able to do random access on the keys) in
> > the second stage.
> >
> > I thought of using HDFS and a colleague suggested Apache Cassandra. Both
> > seem to be implementations of BigTable. I read that HDFS is a file
> > handle hog, but no such thing on the Cassandra site. Would it be
> > preferable, in your opinion, to use one over the other? I suppose I
> > should just try them both, but if someone has done this already, would
> > appreciate their input before doing this.
> >
> > Thanks
> > Sujit
> >
>
>

Reply via email to