Have you looked at hadoop.io.MapWritable? --- Jim Kellerman, Senior Engineer; Powerset
> -----Original Message----- > From: C G [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 19, 2007 11:59 AM > To: hadoop-user@lucene.apache.org > Subject: HashMap which can spill to disk for Hadoop? > > Hi All: > > The aggregation classes in Hadoop use a HashMap to hold > unique values in memory when computing unique counts, etc. I > ran into a situation on 32-node grid (4G memory/node) where a > single node runs out of memory within the reduce phase trying > to manage a very large HashMap. This was disappointing > because the dataset is only 44M rows (4G) of data. This is a > scenario where I am counting unique values associated with > various events, where the total number of events is very > small and the number of unique values is very high. Since > the event IDs serve as keys as the number of distinct event > IDs is small, there is a consequently small number of > reducers running, where each reducer is expected to manage a > very large HashMap of unique values. > > It looks like I need to build my own unique aggregator, so > I am looking for an implementation of HashMap which can spill > to disk as needed. I've considered using BDB as a backing > store, and I've looking into Derby's BackingStoreHashtable as well. > > For the present time I can restructure my data in an > attempt to get more reducers to run, but I can see in the > very near future where even that will run out of memory. > > Any thoughts,comments, or flames? > > Thanks, > C G > > > > --------------------------------- > Looking for last minute shopping deals? Find them fast with > Yahoo! Search. >