Yeah there is a lot of overhead involved in doing it this way: you have several HashMaps, but also, each entry involves a HashMap.Entry object. Object overhead on a 64-bit machine is at least 12 bytes. (8 byte reference + 4 byte object overhead)
I can recommend a few possible strategies: - Those secondary HashMaps -- don't use the default size of 11. Set it to something like 1 or 2 to start. You will spend more time growing them but waste less memory on maps that don't have more than a few entries - It *might* be possible that a single map with a key being a pair of Strings is more memory efficient here. It depends on how keys are distributed. - Does the secondary map need to be a map? If it has just a handful of keys, a linear search isn't difficult, and you could just map to two parallel arrays of String and float - Do you absolutely need String keys or would Long or Integer do for your purposes Finally from struggling with a very similar issue in my code I ended up writing a class called org.apache.mahout.cf.taste.impl.common.FastMap which uses linear probing instead of separate chaining to implement a hashtable. So, you don't have an extra object per entry, though you need more hash buckets to get reasonable performance. For Maps that are not updated constantly, I find it's more memory efficient for sure and a little faster. As a very last resort, look for custom Map implementations out there (or write your own) that map to a float (primitive) key rather than an Object (Float). Saving even that Object overhead itself would be great. Or you could modify FastMap very easily too to use floats -- it's icky to copy-and-paste code but the savings at scale may be so compelling that it's the right thing to do. On Sun, Jul 13, 2008 at 9:31 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > In my classification code, I create the model easily using Map, Reduce. But > it has become difficult to do classification with big datasets. For big > dataset like wikipedia it has become difficult to load the data into > memory(even though it takes only 600MB on the disk). it shoots past 2.5GB > when i use a HashMap<String, HashMap<String, Float>> to store the weights.I > wish there was this big matrix server out there and all i had to do to fetch > a data was call fetch(row, column). > > > I am trying to put th data on Hbase > > Please tell me if there are simpler solutions to do this using hadoop. or > any other package > > Robin >
