Re: Loading Huge Models for classification

Robin Anil Sun, 13 Jul 2008 09:30:44 -0700

On Sun, Jul 13, 2008 at 9:26 PM, Sean Owen <[EMAIL PROTECTED]> wrote:


> Yeah there is a lot of overhead involved in doing it this way: you
> have several HashMaps, but also, each entry involves a HashMap.Entry
> object. Object overhead on a 64-bit machine is at least 12 bytes. (8
> byte reference + 4 byte object overhead)
>
> I can recommend a few possible strategies:
> - Those secondary HashMaps -- don't use the default size of 11. Set it
> to something like 1 or 2 to start. You will spend more time growing
> them but waste less memory on maps that don't have more than a few
> entries

Did that. Same amount of memory used. Just putting the capacity in the
constructor as 1 right?

>
> - It *might* be possible that a single map with a key being a pair of
> Strings is more memory efficient here. It depends on how keys are
> distributed.
> - Does the secondary map need to be a map? If it has just a handful of
> keys, a linear search isn't difficult, and you could just map to two
> parallel arrays of String and float

Some features occur only in certain labels. There will be many features
which are unique to a label or to two labels (depends on the data).   Will
the size be lower if the parallel arrays were ArrayLists?

>
> - Do you absolutely need String keys or would Long or Integer do for
> your purposes
>
I am keeping a mapping of String keys to integers. I tried using
Map<Integer, Map<Integer, String>> . The memory is no different but there it
takes a few seconds longer due to lookups for every String encountered.

>
> Finally from struggling with a very similar issue in my code I ended
> up writing a class called
> org.apache.mahout.cf.taste.impl.common.FastMap which uses linear
> probing instead of separate chaining to implement a hashtable. So, you
> don't have an extra object per entry, though you need more hash
> buckets to get reasonable performance. For Maps that are not updated
> constantly, I find it's more memory efficient for sure and a little
> faster.
>
> As a very last resort, look for custom Map implementations out there
> (or write your own) that map to a float (primitive) key rather than an
> Object (Float). Saving even that Object overhead itself would be
> great.
>
> Or you could modify FastMap very easily too to use floats -- it's icky
> to copy-and-paste code but the savings at scale may be so compelling
> that it's the right thing to do.
>
I will try out with the FastMap  <String - > FastMap  <Integer - > Float>  >
. I will post the results.


>
>
> On Sun, Jul 13, 2008 at 9:31 AM, Robin Anil <[EMAIL PROTECTED]> wrote:
> > In my classification code, I create the model easily using Map, Reduce.
> But
> > it has become difficult to do classification with big datasets.  For big
> > dataset like wikipedia it has become difficult to load the data into
> > memory(even though it takes only 600MB on the disk). it shoots past 2.5GB
> > when i use a HashMap<String, HashMap<String, Float>> to store the
> weights.I
> > wish there was this big matrix server out there and all i had to do to
> fetch
> > a data was call fetch(row, column).
> >
> >
> > I am trying to put th data on Hbase
> >
> > Please tell me if there are simpler solutions to do this using hadoop. or
> > any other package
> >
> > Robin
> >
>


Robin

Re: Loading Huge Models for classification

Reply via email to