Sounds much better to me.
On 12/26/07 7:53 AM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote: > > With a secondary sort on the values during the shuffle, nothing would > need to be kept in memory, since it could all be counted in a single > scan. Right? Wouldn't that be a much more efficient solution? > > On Dec 20, 2007, at 6:30 AM, C G wrote: > >> If we decide to implement our own file-backed Hashmap I'd be >> willing to contribute it back to the project. >> >> We are rolling up unique counts per event ID. So we use event ID >> as a key, and want to count the number of unique event values. >> Since the number of event IDs is reasonably small (well under 100) >> and the universe of values is large (potentially millions) we wind >> up in a situation where we are pushing too much into memory. >> >> Ted commented about special purpose Hashmaps. Unfortunatley, our >> event values can be up to 128 bits long so I don't think a special >> purpose Hashmap would work. >> >> Thanks, >> C G >> >> Runping Qi <[EMAIL PROTECTED]> wrote: >> >> It would be nice if you can contribute a file backed hashmap, or a >> file >> backed implementation of the unique count aggregator. >> >> Short of that, if you just need to count the unique values for each >> event id, you can do so by using the aggregate classes with each >> event-id/event-value pair as a key and simply counting the number of >> occurrences of each composite key. >> >> Runping >> >> >>> -----Original Message----- >>> From: C G [mailto:[EMAIL PROTECTED] >>> Sent: Wednesday, December 19, 2007 11:59 AM >>> To: hadoop-user@lucene.apache.org >>> Subject: HashMap which can spill to disk for Hadoop? >>> >>> Hi All: >>> >>> The aggregation classes in Hadoop use a HashMap to hold unique >> values in >>> memory when computing unique counts, etc. I ran into a situation on >> 32- >>> node grid (4G memory/node) where a single node runs out of memory >> within >>> the reduce phase trying to manage a very large HashMap. This was >>> disappointing because the dataset is only 44M rows (4G) of data. This >> is >>> a scenario where I am counting unique values associated with various >>> events, where the total number of events is very small and the number >> of >>> unique values is very high. Since the event IDs serve as keys as the >>> number of distinct event IDs is small, there is a consequently small >>> number of reducers running, where each reducer is expected to >>> manage a >>> very large HashMap of unique values. >>> >>> It looks like I need to build my own unique aggregator, so I am >> looking >>> for an implementation of HashMap which can spill to disk as needed. >> I've >>> considered using BDB as a backing store, and I've looking into >>> Derby's >>> BackingStoreHashtable as well. >>> >>> For the present time I can restructure my data in an attempt to get >> more >>> reducers to run, but I can see in the very near future where even >>> that >>> will run out of memory. >>> >>> Any thoughts,comments, or flames? >>> >>> Thanks, >>> C G >>> >>> >>> >>> --------------------------------- >>> Looking for last minute shopping deals? Find them fast with Yahoo! >>> Search. >> >> >> >> --------------------------------- >> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. >> Try it now. >