Sounds much better to me.

On 12/26/07 7:53 AM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote:

> 
> With a secondary sort on the values during the shuffle, nothing would
> need to be kept in memory, since it could all be counted in a single
> scan.  Right? Wouldn't that be a much more efficient solution?
> 
> On Dec 20, 2007, at 6:30 AM, C G wrote:
> 
>> If we decide to implement our own file-backed Hashmap I'd be
>> willing to contribute it back to the project.
>> 
>>   We are rolling up unique counts per event ID.  So we use event ID
>> as a key, and want to count the number of unique event values.
>> Since the number of event IDs is reasonably small (well under 100)
>> and the universe of values is large (potentially millions) we wind
>> up in a situation where we are pushing too much into memory.
>> 
>>   Ted commented about special purpose Hashmaps.  Unfortunatley, our
>> event values can be up to 128 bits long so I don't think a special
>> purpose Hashmap would work.
>> 
>>   Thanks,
>>   C G
>> 
>> Runping Qi <[EMAIL PROTECTED]> wrote:
>> 
>> It would be nice if you can contribute a file backed hashmap, or a
>> file
>> backed implementation of the unique count aggregator.
>> 
>> Short of that, if you just need to count the unique values for each
>> event id, you can do so by using the aggregate classes with each
>> event-id/event-value pair as a key and simply counting the number of
>> occurrences of each composite key.
>> 
>> Runping
>> 
>> 
>>> -----Original Message-----
>>> From: C G [mailto:[EMAIL PROTECTED]
>>> Sent: Wednesday, December 19, 2007 11:59 AM
>>> To: hadoop-user@lucene.apache.org
>>> Subject: HashMap which can spill to disk for Hadoop?
>>> 
>>> Hi All:
>>> 
>>> The aggregation classes in Hadoop use a HashMap to hold unique
>> values in
>>> memory when computing unique counts, etc. I ran into a situation on
>> 32-
>>> node grid (4G memory/node) where a single node runs out of memory
>> within
>>> the reduce phase trying to manage a very large HashMap. This was
>>> disappointing because the dataset is only 44M rows (4G) of data. This
>> is
>>> a scenario where I am counting unique values associated with various
>>> events, where the total number of events is very small and the number
>> of
>>> unique values is very high. Since the event IDs serve as keys as the
>>> number of distinct event IDs is small, there is a consequently small
>>> number of reducers running, where each reducer is expected to
>>> manage a
>>> very large HashMap of unique values.
>>> 
>>> It looks like I need to build my own unique aggregator, so I am
>> looking
>>> for an implementation of HashMap which can spill to disk as needed.
>> I've
>>> considered using BDB as a backing store, and I've looking into
>>> Derby's
>>> BackingStoreHashtable as well.
>>> 
>>> For the present time I can restructure my data in an attempt to get
>> more
>>> reducers to run, but I can see in the very near future where even
>>> that
>>> will run out of memory.
>>> 
>>> Any thoughts,comments, or flames?
>>> 
>>> Thanks,
>>> C G
>>> 
>>> 
>>> 
>>> ---------------------------------
>>> Looking for last minute shopping deals? Find them fast with Yahoo!
>>> Search.
>> 
>> 
>> 
>> ---------------------------------
>> Be a better friend, newshound, and know-it-all with Yahoo! Mobile.
>> Try it now.
> 

Reply via email to