Hi All:
   
  The aggregation classes in Hadoop use a HashMap to hold unique values in 
memory when computing unique counts, etc.  I ran into a situation on 32-node 
grid (4G memory/node) where a single node runs out of memory within the reduce 
phase trying to manage a very large HashMap.  This was disappointing because 
the dataset is only 44M rows (4G) of data.  This is a scenario where I am 
counting unique values associated with various events, where the total number 
of events is very small and the number of unique values is very high.  Since 
the event IDs serve as keys as the number of distinct event IDs is small, there 
is a consequently small number of reducers running, where each reducer is 
expected to manage a very large HashMap of unique values.
   
  It looks like I need to build my own unique aggregator, so I am looking for 
an implementation of HashMap which can spill to disk as needed.  I've 
considered using BDB as a backing store, and I've looking into Derby's 
BackingStoreHashtable as well.  
   
  For the present time I can restructure my data in an attempt to get more 
reducers to run, but I can see in the very near future where even that will run 
out of memory.
   
  Any thoughts,comments, or flames?
   
  Thanks,
  C G
   

       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

Reply via email to