Sure, I'll discuss a bit of what I wanted to do and what operations are required.
The issue I had when I used naive Java to load up the Netflix dataset was that it simply didn't fit in memory. Even on a machine with 24GB RAM, probably due to the appendage that any data type comes with in Java. There's no question that the source data has to live in some distributed fashion, as with any other algorithm, in Mahout. The general access pattern to the dataset within any algo would be to find the lists of (users,ratings) for each movie and to find the lists of (movies, ratings) for each user in a fast way. In C++, this design requirement can be satisfied using an in-memory approach. One solution is we first design a bit access pattern such that we fit (user, rating) or (movie, rating) in 32 bits(Actually even the date can fit, but that depends if we have a temporal model to use it :) ). This does put some limitations on the size of the user, movie but at least for Netflix - 500,000 odd users and 17,770 movies can be made to fit within 32 bits. Then we generate two different arrays of 32-bit units, one indexed by user and other by movie. So the first one would be like: User 1: Movie 3, Rating 4 Movie 17, Rating 5 ... ... User 2: Movie 5, Rating 1 ... and vice versa. With dates if necessary. Each of these arrays are around 300MB, so both of them could fit into memory within 600 MB. These are precomputed and the dataset is always accessed in this format. Using a mmap(), they're loaded into memory and all algorithms(KNN, SVD, RBM) take advantage of the fast iterating ability and run quite fast. There's a short directory index in the beginning, which essentially mentions the indices where the user-indexed array begins, and where the movie-indexed array begins - for each of the 480,000 odd users and for each of the 17,770 movies. So, the cost of finding the location of a user's or movie's entries is limited to two array calls. These are the errors that show up in the current .diff implementation of a pure RBM - because I've worked with the above idea(a preconceived notion due to a previous C++ based approach) while I wrote it till I find the right way of doing this on Mahout(right data structure, right access iterator, and so on) after discussing on the list. userent[base0+j]&USER_MOVIEMASK is marked TODO: Replace at several places. The base0 refers to the index where a particular user's entries begin that is updated as needed before any user or movie-specific iteration for loading the RBM data structures at any point. The mask separates the rating and there are a couple of masks used. In the current context, I had the following problems with the above plan: 1. *Read the netflix-dataset.csv file of (user,movie,rating,date) and write it as ? in HDFS.* Should ? be a sparse matrix or a distributed row matrix or something else? Either way, would it be necessary to convert it to the above C++ format? The guarantee that the format here should provide at the minimum is a fast access to each user's or movie's entries throughout the dataset. I'm not clear on what to choose for this purpose from those within Mahout. The second part of the problem is, there are a ton of data structures that we keep building during the algorithm training. This is the second issue I have with regard to choosing the right Mahout-specific access methods/data structures. But a short excursion now to the algorithm itself. This <http://www.machinelearning.org/proceedings/icml2007/papers/407.pdf> is the algorithm that has been tested on the Netflix dataset. There is a MATLAB version of it provided by the authors for MNIST digits dataset, and there are independent C++ versions for Netflix. A key design idea is that we'll have just two layers - a visible softmax layer and a binary/gaussian/etc. hidden layer as per the variants. The mapping is also restricted in such a fashion to allow distributed computation. Another key idea is that instead of differential-based learning(which takes exponential time), we use Contrastive Divergence, which was the author's contribution in a previous paper. This has been reported to give a good approximation and performance balance. The algorithm predicts a rating for a (user,movie) pair in time linear to the number of hidden units. There are also variants which include *conditional *RBM's and *factored *RBM's. The conditional variety uses the information that a given user has already watched a movie in the test dataset - which does say something. Also, not watching a movie is also useful information and it uses it. This is modelled by using diff formulae for each stage of the RBM. Factoring is a way to reduce the amount of factors used in predictions - it may or may not be a good choice when we have a cluster to use so the aim is to provide all these varieties in a neatly wrapped package in Mahout. Now, the RBM implemented in MAHOUT-375.diff is a pure RBM - so no conditional or factored versions yet. But I planned on writing this one first so that I could then tackle the issue of the data input and data structures and then rapidly iterate by refactoring, and add in the other variants. A few examples of the operations - for (i=0; i<softmax; i++) { visbiases[j][i] = Math.log(((double)moviercount[j*softmax+i])/((double)mtot)); } So here, the biases of the visible units are being updated using a movie's specific data. There's another for loop for summing up mtot (mtot += moviercount[j*softmax+k]) with k from 0 - 5. This happens a lot of times. Summation of a particular length of an array is something that is done extensively again and again in many places. Here's another example - CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) - weightCost * vishid[m][rr][h]); So this is a part in the contrastive divergence learning approximation that I mentioned earlier. rr varies from 0->softmax, h from 0->totalFeatures, and m from 0-> numItems(480,000 in the case of Netflix dataset - this factor alone). And this happens in every iteration. This part would benefit immensely from distributed data structures and associated operations. Another naive idea is of course, that the predictions for 2.8 million entries(in the Netflix Prize qualifying set) or for any other equal or larger generic recommender system would benefit immensely from doing the predictions in parallel. I guess this would be clear once the previous issue is figured out. So in short, the second issue I'm faced with is: 2. What are the distributed data structures/operations/iterators that would be perfect for these algorithm-related data structures/operations in Mahout? So, in short, the first issue is with obtaining a fast iterator across user-indexed entries and movie-indexed entries(whether it be with a sparse matrix, or distributed row matrix is the doubt), and the second issue is how do I go about parallelizing the above operations within the algorithm(once I've got the user iterator and movie iterator, and format cleared up), which are terribly inefficient in a serial format(and one of the primary issues with any deep belief net). I really think that rashly doing something is going to cause me a lot of days of misspent time, so that's why I am trying to figure this out very clearly on the list with your advices before going ahead. Please do mention anything that you find of relevance - it would be immensely useful to me. Thanks really for the response so far - it's very encouraging. Sisir
