On Tue, Jun 1, 2010 at 2:21 AM, Ted Dunning <[email protected]> wrote:
> On Mon, May 31, 2010 at 12:40 PM, Sisir Koppaka <[email protected] > >wrote: > > > > The general access pattern to the dataset within any algo would be to > find > > the lists of (users,ratings) for each movie and to find the lists of > > (movies, ratings) for each user in a fast way. In C++, this design > > requirement can be satisfied using an in-memory approach. > > > > This pattern of access is essentially the same as a matrix multiplication. > IF A is the user x movie matrix, then your computation has the same > computational shape as the multiplication A' A. > > This was insightful. I wanted to get out of my current thinking of the problem. Thanks. > > One solution is we first design a bit access pattern such that we fit > > (user, > > rating) or (movie, rating) in 32 bits(Actually even the date can fit, but > > that depends if we have a temporal model to use it :) ). > > > I don't think that this is a good design decision at all. Much better to > keep your structures general. Many large applications will have more than > 20,000 items. To use Veoh as an example again, we had tens of millions of > users and about 10 million videos. I will definitely change that. I was trying to explain why I used userent[] in the code...which was because I was still thinking along those lines. Course correction for sure. I was trying to find possible Mahout data structures here that would be useful in replacing the current non-generic format. > So the first one would be like: > > > > User 1: > > Movie 3, Rating 4 > > Movie 17, Rating 5 > > > This is just a sparse matrix. You should use a pre-designed data structure > for this. > Ok... > > > and vice versa. With dates if necessary. > > > > Adding dates is easily done with a second matrix. Make sure you notice > that > there are two kinds of sparse matrix. You want the sequential access kind. > You might look to see if you can reduce your storage requirements by > storing the equivalent of A'A instead of A' and A as you are suggesting. > > This was also very useful, thanks. > > > A few examples of the operations - > > > > for (i=0; i<softmax; i++) { > > visbiases[j][i] = > > Math.log(((double)moviercount[j*softmax+i])/((double)mtot)); > > } > > > > Here's another example - > > CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) - > > weightCost * vishid[m][rr][h]); > > > c = new DenseMatrix(a.rows(), b.columns()) > for (int i = 0; i < a.rows(); i++) { > for (int j = 0; j < b.columns(); j++) { > double sum = 0; > for (int k = 0; k < b.columns(); k++) { > sum += a.get(i, k) * b.get(k, j); > } > c.put(i, j, sum); > } > } > > Since memory is much slower than the arithmetic in modern processes, this > creates a memory bottleneck. What you need to do, then is to re-use > elements many times to avoid reading them from main memory. To do this, you > need to do a block decomposition of the matrix multiply and you may need to > do two levels of block decomposition in order to reflect L1 cache and number > of available registers > > > Thanks again for this... ... > > So in short, the second issue I'm faced with is: > > > > 2. What are the distributed data structures/operations/iterators that > would > > be perfect for these algorithm-related data structures/operations in > > Mahout? > > > > Write down your algorithm first. Don't just point to a paper that > references another paper. > > Yes I've made notes, it's that I was focussing on individual operations rather than the bigger picture and typographically, I thought it more convenient to quote the paper in the mail. > > I really think that rashly doing something is going to cause me a lot of > > days of misspent time, > > > You are correct here, but not in the way that you think. > :) Thanks, I hope so. Sisir
