On Sun, Jul 13, 2008 at 9:26 PM, Sean Owen <[EMAIL PROTECTED]> wrote:
> Yeah there is a lot of overhead involved in doing it this way: you > have several HashMaps, but also, each entry involves a HashMap.Entry > object. Object overhead on a 64-bit machine is at least 12 bytes. (8 > byte reference + 4 byte object overhead) > > I can recommend a few possible strategies: > - Those secondary HashMaps -- don't use the default size of 11. Set it > to something like 1 or 2 to start. You will spend more time growing > them but waste less memory on maps that don't have more than a few > entries Did that. Same amount of memory used. Just putting the capacity in the constructor as 1 right? > > - It *might* be possible that a single map with a key being a pair of > Strings is more memory efficient here. It depends on how keys are > distributed. > - Does the secondary map need to be a map? If it has just a handful of > keys, a linear search isn't difficult, and you could just map to two > parallel arrays of String and float Some features occur only in certain labels. There will be many features which are unique to a label or to two labels (depends on the data). Will the size be lower if the parallel arrays were ArrayLists? > > - Do you absolutely need String keys or would Long or Integer do for > your purposes > I am keeping a mapping of String keys to integers. I tried using Map<Integer, Map<Integer, String>> . The memory is no different but there it takes a few seconds longer due to lookups for every String encountered. > > Finally from struggling with a very similar issue in my code I ended > up writing a class called > org.apache.mahout.cf.taste.impl.common.FastMap which uses linear > probing instead of separate chaining to implement a hashtable. So, you > don't have an extra object per entry, though you need more hash > buckets to get reasonable performance. For Maps that are not updated > constantly, I find it's more memory efficient for sure and a little > faster. > > As a very last resort, look for custom Map implementations out there > (or write your own) that map to a float (primitive) key rather than an > Object (Float). Saving even that Object overhead itself would be > great. > > Or you could modify FastMap very easily too to use floats -- it's icky > to copy-and-paste code but the savings at scale may be so compelling > that it's the right thing to do. > I will try out with the FastMap <String - > FastMap <Integer - > Float> > . I will post the results. > > > On Sun, Jul 13, 2008 at 9:31 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > > In my classification code, I create the model easily using Map, Reduce. > But > > it has become difficult to do classification with big datasets. For big > > dataset like wikipedia it has become difficult to load the data into > > memory(even though it takes only 600MB on the disk). it shoots past 2.5GB > > when i use a HashMap<String, HashMap<String, Float>> to store the > weights.I > > wish there was this big matrix server out there and all i had to do to > fetch > > a data was call fetch(row, column). > > > > > > I am trying to put th data on Hbase > > > > Please tell me if there are simpler solutions to do this using hadoop. or > > any other package > > > > Robin > > > Robin
