Hi,
      Just an update. I incorporated most of the things that were discussed.
Used the mahout.common SparseMatrix implementation. Changed the SparseVector
HashMap initial capacity to 1. Changed most of the Maps into arrays. Used
primities as far as possible.

The only Map that i have is Map<String, integer> which maps the feature to
the featureId.

The only gripe i have is that, i couldnt iterate through the sparseVector
row to get the columnId along with cell value. iterator just returns an
Iterable<Element>. Shouldnt it be there? So i had to iterate over the entire
number of columns and keep checking if the value returned is zero. ( i could
save a lot cycles by iterating over the nonzero cells)

20MB model of 20Newsgroups takes 82MB RAM.
600MB wikipedia model takes 2.4GB RAM.

The classification testing of wikipedia dataset(2.2GB) is going at the rate
100Mb per hour. (34K documents per hour / 567 documents per minute)

I am putting down what i think will be good way to scale. Dont know how much
can be done by me. But here goes

1). Keep a Classification Server which exposes a method to classify an input
document. (there will be features, label to weight mappring).
2)  If the memory is limited, rest of the features can be loaded on another
server(swapping in and out seems slower than network latency plus i couldnt
go beyond 2600MB Heap space on a 4GB 32bit system)
3)  A Map/Reduce Job can take the take each document, classify and then
output the most probable class for each doc_id (Dunno how a Distributed
Classification server helps here. Prolly network will be the bottle neck)

Robin

On Mon, Jul 14, 2008 at 5:01 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> This sort of operation should not be done using maps per se.  It should be
> done using a sparse matrix implementation (which does, of course, have
> operations similar to a map)
>
> The major difference is that the overhead of storing elements must be much,
> much lower than the overhead in storing elements in nested maps.
>
> For instance, floating point values should be stored in large dense arrays
> of floats or doubles (preferably doubles).  The indexing operations that
> are
> used to find the particular elements can be done in a number of ways, but
> one easy way is to keep an array of variable sized arrays of integers each
> of which corresponds to the locations of values in the array of floats.
>  The
> overhead of using generic collections usually runs about 40-100 bytes per
> entry.  The overhead of a specialized data structure can be as little as 10
> bytes or so in a simple implementation and possibly somewhat smaller for
> specialized implementations.  This means that sparse matrices can be within
> 2-4x the size of a dense matrix that has the same number of elements as the
> sparse matrix has non-zero elements.
>
> Mahout contains some good, though very basic, matrix code that implements
> these sorts of strategies.  See, for instance, SparseRowMatrix.
>
> On Sun, Jul 13, 2008 at 9:39 AM, Robin Anil <[EMAIL PROTECTED]> wrote:
>
> > Nope no difference with FastMap<K,V> i guess i will have to try with the
> > primitive
> >
>
>
>
> --
> ted
>

Reply via email to