Hi, I have a situation where I need to "collect" data into some sort of common medium from a set of mapreduce jobs, then have another mapreduce job "consolidate" these to provide the final result. I was considering using some sort of database to store the output of the first stage and then read them (I need to be able to do random access on the keys) in the second stage.
I thought of using HDFS and a colleague suggested Apache Cassandra. Both seem to be implementations of BigTable. I read that HDFS is a file handle hog, but no such thing on the Cassandra site. Would it be preferable, in your opinion, to use one over the other? I suppose I should just try them both, but if someone has done this already, would appreciate their input before doing this. Thanks Sujit
