Hey Sisir, Some questions, to hopefully spark some of your thinking:
If you try to step into the mindset of map/reduce steps, can you outline what the algorithm is doing during the steps a) updating the input weights, b) updating the hidden weights, and c) contrastive divergence? If you put all of the ratings of a particular user into a Vector (which structurally looks like a Map<Integer,Double>, but is more efficient and has lots of mathematical methods available on it), with the keys being the movieId's the user has rated, and the value being their rating, then your entire data set could live on HDFS as a DistributedRowMatrix. Similarly, if you take the transpose() of that matrix, you get back an HDFS file whose rows are Vector instances of the ratings for a given movie, with the entries keyed on the userId who rated them. The algorithm for training in the Hinton paper you linked to seems to allow for a pretty nice "per-user" parallelization (reading section 2.2, in particular) - each user gets "their own" separate RBM, with a constraint that after every iteration, you need to average together all of the connection weights (and biases) between all of the users, and use this to initialize into the next iteration. In particular, it looks like what you do is distribute the connection coefficient matrix (well, matrices, because for the discrete weightings case that you and Hinton are considering, you actually have one set of weights per rating, 1,...,5) via DistributedCache to all of the Hadoop nodes. This mostly fine: the size of this data is 5 * numItems (lets not get to caught up in them being movies ;) ) * numHiddenNodes * 8bytes, so mappers would be able to fit this all in memory for 100 hidden units up to about 100K movies for small-memory footprint Mappers, and you could probably scale up to 10x that if you give up the memory allowable in the Mapper JVM's. To scale further, we would need to do further parallelization. But ok, so you distribute the "current" set of biases and connections to all of your mappers, and load them into memory at the start of everything, and then the Mappers iterate over each user at a time (getting all of their ratings at once), and you can iterate over each of their ratings, updating a copy of a portion of the connections and biases matrix (just the part which corresponds to the items the user has rated) using CD, and then you *write the updated values to HDFS* ( this is where some cleverness should probably come in: you want to write out key/value pairs, keyed so that the same key gets multiple values, you can do averaging easily in parallel too. Probably keying on index of the hidden units might work: send any updated weights and biases for hidden unit "j" out as (j, deltaW_j), where deltaW_j is really a matrix of values deltaW_ijk being the connections between hidden unit j to rating k of item i. The reduce step would then be really just a sum and average, and after that map/reduce pass you need to then redistribute the new W matrix out to all the nodes again for the next iteration. I'm not a RBM-expert, so I'm not sure if I have the way the training on these things work, but it certainly seems like the operation looks a lot like Ted was saying: a matrix multiplication of the observed rating matrix times the weights matrix (plus the bias, and then taking the logistic function applied to each of the values on the output). It's a little trickier than normal matrix multiplication, because the matrix in question has this extra index of "which rating" it rated, but that could get folded into the inner map function. Essentially you could do a map-side join of the visible ratings matrix and the weights matrix, joined on the key they share in common: movies (items), and do (for each movie i, hidden unit j, and rating k) : X_ij = Sum_(k=1-5) (v_ik * W_ijk) then emit key,value pairs as (j, X_ij). The reducer then sums up all of the different X_ij for different movies i, adds on the bias b_j (which will have to have been loaded into memory), takes the logit, and you've got the current guess as to p(h_j = 1 | V). This technique will require more map-reduce steps than the one above, but will scale better (no memory limitations). So I'm not sure if any of this was helpful, but I'm just trying to brainstorm along with you on ways you can distribute this problem such that you're not stuck calculating how much fits in memory. Let me know which parts of this make sense, and which parts are the ramblings of someone who doesn't understand that much about neural nets in general, let alone RBM's (but I'd like to learn more!). -jake On Mon, May 31, 2010 at 2:32 PM, Sisir Koppaka <[email protected]>wrote: > On Tue, Jun 1, 2010 at 2:21 AM, Ted Dunning <[email protected]> wrote: > > > On Mon, May 31, 2010 at 12:40 PM, Sisir Koppaka <[email protected] > > >wrote: > > > > > > > The general access pattern to the dataset within any algo would be to > > find > > > the lists of (users,ratings) for each movie and to find the lists of > > > (movies, ratings) for each user in a fast way. In C++, this design > > > requirement can be satisfied using an in-memory approach. > > > > > > > This pattern of access is essentially the same as a matrix > multiplication. > > IF A is the user x movie matrix, then your computation has the same > > computational shape as the multiplication A' A. > > > > > This was insightful. I wanted to get out of my current thinking of the > problem. Thanks. > > > > > One solution is we first design a bit access pattern such that we fit > > > (user, > > > rating) or (movie, rating) in 32 bits(Actually even the date can fit, > but > > > that depends if we have a temporal model to use it :) ). > > > > > > I don't think that this is a good design decision at all. Much better to > > keep your structures general. Many large applications will have more > than > > 20,000 items. To use Veoh as an example again, we had tens of millions > of > > users and about 10 million videos. > > > I will definitely change that. I was trying to explain why I used userent[] > in the code...which was because I was still thinking along those lines. > Course correction for sure. I was trying to find possible Mahout data > structures here that would be useful in replacing the current non-generic > format. > > > > > So the first one would be like: > > > > > > User 1: > > > Movie 3, Rating 4 > > > Movie 17, Rating 5 > > > > > > This is just a sparse matrix. You should use a pre-designed data > structure > > for this. > > > > Ok... > > > > > > > > and vice versa. With dates if necessary. > > > > > > > Adding dates is easily done with a second matrix. Make sure you notice > > that > > there are two kinds of sparse matrix. You want the sequential access > kind. > > You might look to see if you can reduce your storage requirements by > > storing the equivalent of A'A instead of A' and A as you are suggesting. > > > > > This was also very useful, thanks. > > > > > > > A few examples of the operations - > > > > > > for (i=0; i<softmax; i++) { > > > visbiases[j][i] = > > > Math.log(((double)moviercount[j*softmax+i])/((double)mtot)); > > > } > > > > > > Here's another example - > > > CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) > - > > > weightCost * vishid[m][rr][h]); > > > > > c = new DenseMatrix(a.rows(), b.columns()) > > for (int i = 0; i < a.rows(); i++) { > > for (int j = 0; j < b.columns(); j++) { > > double sum = 0; > > for (int k = 0; k < b.columns(); k++) { > > sum += a.get(i, k) * b.get(k, j); > > } > > c.put(i, j, sum); > > } > > } > > > > Since memory is much slower than the arithmetic in modern processes, this > > creates a memory bottleneck. What you need to do, then is to re-use > > elements many times to avoid reading them from main memory. To do this, > you > > need to do a block decomposition of the matrix multiply and you may need > to > > do two levels of block decomposition in order to reflect L1 cache and > number > > of available registers > > > > > > > Thanks again for this... > > ... > > > So in short, the second issue I'm faced with is: > > > > > > 2. What are the distributed data structures/operations/iterators that > > would > > > be perfect for these algorithm-related data structures/operations in > > > Mahout? > > > > > > > Write down your algorithm first. Don't just point to a paper that > > references another paper. > > > > Yes I've made notes, it's that I was focussing on individual operations > rather than the bigger picture and typographically, I thought it more > convenient to quote the paper in the mail. > > > > > I really think that rashly doing something is going to cause me a lot > of > > > days of misspent time, > > > > > > You are correct here, but not in the way that you think. > > > > :) Thanks, I hope so. > > Sisir >
