Re: Loading Huge Models for classification

Sean Owen Sun, 13 Jul 2008 08:56:55 -0700

Yeah there is a lot of overhead involved in doing it this way: you
have several HashMaps, but also, each entry involves a HashMap.Entry
object. Object overhead on a 64-bit machine is at least 12 bytes. (8
byte reference + 4 byte object overhead)

I can recommend a few possible strategies:
- Those secondary HashMaps -- don't use the default size of 11. Set it
to something like 1 or 2 to start. You will spend more time growing
them but waste less memory on maps that don't have more than a few
entries
- It *might* be possible that a single map with a key being a pair of
Strings is more memory efficient here. It depends on how keys are
distributed.
- Does the secondary map need to be a map? If it has just a handful of
keys, a linear search isn't difficult, and you could just map to two
parallel arrays of String and float
- Do you absolutely need String keys or would Long or Integer do for
your purposes

Finally from struggling with a very similar issue in my code I ended
up writing a class called
org.apache.mahout.cf.taste.impl.common.FastMap which uses linear
probing instead of separate chaining to implement a hashtable. So, you
don't have an extra object per entry, though you need more hash
buckets to get reasonable performance. For Maps that are not updated
constantly, I find it's more memory efficient for sure and a little
faster.

As a very last resort, look for custom Map implementations out there
(or write your own) that map to a float (primitive) key rather than an
Object (Float). Saving even that Object overhead itself would be
great.

Or you could modify FastMap very easily too to use floats -- it's icky
to copy-and-paste code but the savings at scale may be so compelling
that it's the right thing to do.

On Sun, Jul 13, 2008 at 9:31 AM, Robin Anil <[EMAIL PROTECTED]> wrote:
> In my classification code, I create the model easily using Map, Reduce. But
> it has become difficult to do classification with big datasets.  For big
> dataset like wikipedia it has become difficult to load the data into
> memory(even though it takes only 600MB on the disk). it shoots past 2.5GB
> when i use a HashMap<String, HashMap<String, Float>> to store the weights.I
> wish there was this big matrix server out there and all i had to do to fetch
> a data was call fetch(row, column).
>
>
> I am trying to put th data on Hbase
>
> Please tell me if there are simpler solutions to do this using hadoop. or
> any other package
>
> Robin
>

Re: Loading Huge Models for classification

Reply via email to