On Mon, May 31, 2010 at 12:40 PM, Sisir Koppaka <[email protected]>wrote:
> Sure, I'll discuss a bit of what I wanted to do and what operations are
> required.
>
> The issue I had when I used naive Java to load up the Netflix dataset was
> that it simply didn't fit in memory. Even on a machine with 24GB RAM,
> probably due to the appendage that any data type comes with in Java.
> There's
> no question that the source data has to live in some distributed fashion,
> as
> with any other algorithm, in Mahout.
>
This memory usage is not inherent in Java itself, merely in the use of the
highly generic collections that come for free in Java (although the size you
quote is a bit high).
If you switch to using the sparse matrix structures from Mahout, you should
be able to do much better.
Keep in mind, though, that the 100 million ratings in the Netflix data set
is only a moderate sized dataset. At Veoh, for instance, our recommendation
engine had to handle 50 billion user x item observations and we never really
did anything but barely step into the top 100 web-sites. There are lots of
places that handle far more data.
> The general access pattern to the dataset within any algo would be to find
> the lists of (users,ratings) for each movie and to find the lists of
> (movies, ratings) for each user in a fast way. In C++, this design
> requirement can be satisfied using an in-memory approach.
>
This pattern of access is essentially the same as a matrix multiplication.
IF A is the user x movie matrix, then your computation has the same
computational shape as the multiplication A' A.
> One solution is we first design a bit access pattern such that we fit
> (user,
> rating) or (movie, rating) in 32 bits(Actually even the date can fit, but
> that depends if we have a temporal model to use it :) ).
I don't think that this is a good design decision at all. Much better to
keep your structures general. Many large applications will have more than
20,000 items. To use Veoh as an example again, we had tens of millions of
users and about 10 million videos.
> So the first one would be like:
>
> User 1:
> Movie 3, Rating 4
> Movie 17, Rating 5
This is just a sparse matrix. You should use a pre-designed data structure
for this.
> and vice versa. With dates if necessary.
>
Adding dates is easily done with a second matrix. Make sure you notice that
there are two kinds of sparse matrix. You want the sequential access kind.
You might look to see if you can reduce your storage requirements by
storing the equivalent of A'A instead of A' and A as you are suggesting.
> In the current context, I had the following problems with the above plan:
> 1. *Read the netflix-dataset.csv file of (user,movie,rating,date) and write
> it as ? in HDFS.* Should ? be a sparse matrix or a distributed row matrix
> or
> something else?
I think that this is just a bit premature. Do you have a document that
describes what you want to do at an algorithmic level? You have given some
hints about this, but picking a data structure should be done after
describing the algorithm. Especially when you are talking about
parallelization, you need to think algorithmically and not get too low level
too soon.
C++ format? The guarantee that the format here should provide at the minimum
> is a fast access to each user's or movie's entries throughout the dataset.
> I'm not clear on what to choose for this purpose from those within Mahout.
>
You keep talking in terms of essentially random access to data structures.
To effectively parallelize this algorithm you will need to move beyond that
and describe how you can smoothly scan the data in parallel. This requires
that you not think about the algorithm in terms of indexing this array or
twiddling that bit, but move up a level to think about what you really
intend.
> A few examples of the operations -
>
> for (i=0; i<softmax; i++) {
> visbiases[j][i] =
> Math.log(((double)moviercount[j*softmax+i])/((double)mtot));
> }
>
> ...
> Summation of a particular length of an array is something that is done
> extensively again and again in many places.
>
> Here's another example -
> CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) -
> weightCost * vishid[m][rr][h]);
>
These loops need to be viewed in a large context to make an intelligent
recommendation. To take a very simple example, many people would write a
dense matrix multiplication this way (this is pseudo code, don't try to
compile it):
c = new DenseMatrix(a.rows(), b.columns())
for (int i = 0; i < a.rows(); i++) {
for (int j = 0; j < b.columns(); j++) {
double sum = 0;
for (int k = 0; k < b.columns(); k++) {
sum += a.get(i, k) * b.get(k, j);
}
c.put(i, j, sum);
}
}
This looks fine, but is typically nearly an order of magnitude slower than
the optimal code. If you try to "optimize" this code as it stands, you are
very unlikely to succeed. If, on the other hand, you back up a step and
think about what matrix multiplication really is, and what it costs to do,
you will quickly realize that the real problem is that there are O(n^3)
arithmetic operations against O(n^2) data elements, but that this set of
loops is doing O(n^3) memory fetches. Since memory is much slower than the
arithmetic in modern processes, this creates a memory bottleneck. What you
need to do, then is to re-use elements many times to avoid reading them from
main memory. To do this, you need to do a block decomposition of the matrix
multiply and you may need to do two levels of block decomposition in order
to reflect L1 cache and number of available registers
The lesson here is not just to re-use elements. That is a valid
micro-lesson, but the big lesson is to back up to the mathematics before
trying to write the code, especially when moving to a new architecture. A
medium sized lesson is to use abstract operations for as big a piece of what
you are going as possible so that you make use of somebody else's
optimization efforts.
...
> So in short, the second issue I'm faced with is:
>
> 2. What are the distributed data structures/operations/iterators that would
> be perfect for these algorithm-related data structures/operations in
> Mahout?
>
Write down your algorithm first. Don't just point to a paper that
references another paper.
> So, in short, the first issue is with obtaining a fast iterator across
> user-indexed entries and movie-indexed entries(whether it be with a sparse
> matrix, or distributed row matrix is the doubt), and the second issue is
> how
> do I go about parallelizing the above operations within the algorithm(once
> I've got the user iterator and movie iterator, and format cleared up),
> which
> are terribly inefficient in a serial format(and one of the primary issues
> with any deep belief net).
>
Actually, these are probably something like the 4th and 5th issues that you
face. First write your algorithm down using high level notation (NOT
element level pseudo-code). Then think about how you can decompose that in
a map-reduce framework. Then think about what matrix primitives can help
with the decomposed form. Only then should you start to talk about data
structures.
> I really think that rashly doing something is going to cause me a lot of
> days of misspent time,
You are correct here, but not in the way that you think.