Re: [GSOC] Matrix Operations on HDFS

Sisir Koppaka Mon, 31 May 2010 12:41:04 -0700

Sure, I'll discuss a bit of what I wanted to do and what operations are
required.


The issue I had when I used naive Java to load up the Netflix dataset was
that it simply didn't fit in memory. Even on a machine with 24GB RAM,
probably due to the appendage that any data type comes with in Java. There's
no question that the source data has to live in some distributed fashion, as
with any other algorithm, in Mahout.

The general access pattern to the dataset within any algo would be to find
the lists of (users,ratings) for each movie and to find the lists of
(movies, ratings) for each user in a fast way. In C++, this design
requirement can be satisfied using an in-memory approach.

One solution is we first design a bit access pattern such that we fit (user,
rating) or (movie, rating) in 32 bits(Actually even the date can fit, but
that depends if we have a temporal model to use it :) ). This does put some
limitations on the size of the user, movie but at least for Netflix -
500,000 odd users and 17,770 movies can be made to fit within 32 bits. Then
we generate two different arrays of 32-bit units, one indexed by user and
other by movie.

So the first one would be like:

User 1:
Movie 3, Rating 4
Movie 17, Rating 5
...
...
User 2:
Movie 5, Rating 1
...

and vice versa. With dates if necessary.

Each of these arrays are around 300MB, so both of them could fit into memory
within 600 MB. These are precomputed and the dataset is always accessed in
this format. Using a mmap(), they're loaded into memory and all
algorithms(KNN, SVD, RBM) take advantage of the fast iterating ability and
run quite fast.

There's a short directory index in the beginning, which essentially mentions
the indices where the user-indexed array begins, and where the movie-indexed
array begins - for each of the 480,000 odd users and for each of the 17,770
movies. So, the cost of finding the location of a user's or movie's entries
is limited to two array calls.

These are the errors that show up in the current .diff implementation of a
pure RBM - because I've worked with the above idea(a preconceived notion due
to a previous C++ based approach) while I wrote it till I find the right way
of doing this on Mahout(right data structure, right access iterator, and so
on) after discussing on the list.

userent[base0+j]&USER_MOVIEMASK  is marked TODO: Replace at several places.
  The base0 refers to the index where a particular user's entries begin that
is updated as needed before any user or movie-specific iteration for loading
the RBM data structures at any point. The mask separates the rating and
there are a couple of masks used.

In the current context, I had the following problems with the above plan:
1. *Read the netflix-dataset.csv file of (user,movie,rating,date) and write
it as ? in HDFS.* Should ? be a sparse matrix or a distributed row matrix or
something else? Either way, would it be necessary to convert it to the above
C++ format? The guarantee that the format here should provide at the minimum
is a fast access to each user's or movie's entries throughout the dataset.
I'm not clear on what to choose for this purpose from those within Mahout.

The second part of the problem is, there are a ton of data structures that
we keep building during the algorithm training. This is the second issue I
have with regard to choosing the right Mahout-specific access methods/data
structures. But a short excursion now to the algorithm itself.

This <http://www.machinelearning.org/proceedings/icml2007/papers/407.pdf> is
the algorithm that has been tested on the Netflix dataset. There is a MATLAB
version of it provided by the authors for MNIST digits dataset, and there
are independent C++ versions for Netflix.

A key design idea is that we'll have just two layers - a visible softmax
layer and a binary/gaussian/etc. hidden layer as per the variants. The
mapping is also restricted in such a fashion to allow distributed
computation. Another key idea is that instead of differential-based
learning(which takes exponential time), we use Contrastive Divergence, which
was the author's contribution in a previous paper. This has been reported to
give a good approximation and performance balance. The algorithm predicts a
rating for a (user,movie) pair in time linear to the number of hidden units.

There are also variants which include *conditional *RBM's and *factored *RBM's.
The conditional variety uses the information that a given user has already
watched a movie in the test dataset - which does say something. Also, not
watching a movie is also useful information and it uses it. This is modelled
by using diff formulae for each stage of the RBM. Factoring is a way to
reduce the amount of factors used in predictions - it may or may not be a
good choice when we have a cluster to use so the aim is to provide all these
varieties in a neatly wrapped package in Mahout.

Now, the RBM implemented in MAHOUT-375.diff is a pure RBM - so no
conditional or factored versions yet. But I planned on writing this one
first so that I could then tackle the issue of the data input and data
structures and then rapidly iterate by refactoring, and add in the other
variants.

A few examples of the operations -

for (i=0; i<softmax; i++) {
  visbiases[j][i] =
Math.log(((double)moviercount[j*softmax+i])/((double)mtot));
}

So here, the biases of the visible units are being updated using a movie's
specific data. There's another for loop for summing up mtot (mtot +=
moviercount[j*softmax+k]) with k from 0 - 5. This happens a lot of times.

Summation of a particular length of an array is something that is done
extensively again and again in many places.

Here's another example -
CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) -
weightCost * vishid[m][rr][h]);

So this is a part in the contrastive divergence learning approximation that
I mentioned earlier. rr varies from 0->softmax, h from 0->totalFeatures, and
m from 0-> numItems(480,000 in the case of Netflix dataset - this factor
alone). And this happens in every iteration. This part would benefit
immensely from distributed data structures and associated operations.

Another naive idea is of course, that the predictions for 2.8 million
entries(in the Netflix Prize qualifying set) or for any other equal or
larger generic recommender system would benefit immensely from doing the
predictions in parallel. I guess this would be clear once the previous issue
is figured out.

So in short, the second issue I'm faced with is:

2. What are the distributed data structures/operations/iterators that would
be perfect for these algorithm-related data structures/operations in
Mahout?

So, in short, the first issue is with obtaining a fast iterator across
user-indexed entries and movie-indexed entries(whether it be with a sparse
matrix, or distributed row matrix is the doubt), and the second issue is how
do I go about parallelizing the above operations within the algorithm(once
I've got the user iterator and movie iterator, and format cleared up), which
are terribly inefficient in a serial format(and one of the primary issues
with any deep belief net).

I really think that rashly doing something is going to cause me a lot of
days of misspent time, so that's why I am trying to figure this out very
clearly on the list with your advices before going ahead. Please do mention
anything that you find of relevance - it would be immensely useful to me.

Thanks really for the response so far - it's very encouraging.

Sisir

Re: [GSOC] Matrix Operations on HDFS

Reply via email to