On Tue, Jun 1, 2010 at 2:21 AM, Ted Dunning <[email protected]> wrote:

> On Mon, May 31, 2010 at 12:40 PM, Sisir Koppaka <[email protected]
> >wrote:
>
>
> > The general access pattern to the dataset within any algo would be to
> find
> > the lists of (users,ratings) for each movie and to find the lists of
> > (movies, ratings) for each user in a fast way. In C++, this design
> > requirement can be satisfied using an in-memory approach.
> >
>
> This pattern of access is essentially the same as a matrix multiplication.
>  IF A is the user x movie matrix, then your computation has the same
> computational shape as the multiplication A' A.
>
>
This was insightful. I wanted to get out of my current thinking of the
problem. Thanks.


> > One solution is we first design a bit access pattern such that we fit
> > (user,
> > rating) or (movie, rating) in 32 bits(Actually even the date can fit, but
> > that depends if we have a temporal model to use it :) ).
>
>
> I don't think that this is a good design decision at all.  Much better to
> keep your structures general.  Many large applications will have more than
> 20,000 items.  To use Veoh as an example again, we had tens of millions of
> users and about 10 million videos.


I will definitely change that. I was trying to explain why I used userent[]
in the code...which was because I was still thinking along those lines.
Course correction for sure. I was trying to find possible Mahout data
structures here that would be useful in replacing the current non-generic
format.



> So the first one would be like:
> >
> > User 1:
> > Movie 3, Rating 4
> > Movie 17, Rating 5
>
>
> This is just a sparse matrix.  You should use a pre-designed data structure
> for this.
>

Ok...


>

> > and vice versa. With dates if necessary.
> >
>
> Adding dates is easily done with a second matrix.  Make sure you notice
> that
> there are two kinds of sparse matrix.  You want the sequential access kind.
>  You might look to see if you can reduce your storage requirements by
> storing the equivalent of A'A instead of A' and A as you are suggesting.
>
>
This was also very useful, thanks.


>
> > A few examples of the operations -
> >
> > for (i=0; i<softmax; i++) {
> >  visbiases[j][i] =
> > Math.log(((double)moviercount[j*softmax+i])/((double)mtot));
> > }
> >
> > Here's another example -
> > CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) -
> > weightCost * vishid[m][rr][h]);
> >
>     c = new DenseMatrix(a.rows(), b.columns())
>     for (int i = 0; i < a.rows(); i++) {
>        for (int j = 0; j < b.columns(); j++) {
>           double sum = 0;
>           for (int k = 0; k < b.columns(); k++) {
>               sum += a.get(i, k) * b.get(k, j);
>           }
>           c.put(i, j, sum);
>       }
>    }
>
> Since memory is much slower than the arithmetic in modern processes, this
> creates a memory bottleneck.  What you need to do, then is to re-use
> elements many times to avoid reading them from main memory.  To do this, you
> need to do a block decomposition of the matrix multiply and you may need to
> do two levels of block decomposition in order to reflect L1 cache and number
> of available registers
>
>
>
Thanks again for this...

...
> > So in short, the second issue I'm faced with is:
> >
> > 2. What are the distributed data structures/operations/iterators that
> would
> > be perfect for these algorithm-related data structures/operations in
> > Mahout?
> >
>
> Write down your algorithm first.  Don't just point to a paper that
> references another paper.
>
> Yes I've made notes, it's that I was focussing on individual operations
rather than the bigger picture and typographically, I thought it more
convenient to quote the paper in the mail.


> > I really think that rashly doing something is going to cause me a lot of
> > days of misspent time,
>
>
> You are correct here, but not in the way that you think.
>

:) Thanks, I hope so.

Sisir

Reply via email to