Re: GSoC Update

Jake Mannix Tue, 08 Jun 2010 15:48:27 -0700

On Tue, Jun 8, 2010 at 3:20 PM, Sean Owen <[email protected]> wrote:

> Let me take the liberty of moving this onto the dev list since I think
> it will be most beneficial for you to interact with more than just me.
>
> Jake can speak to everything needed to get to the solved-SVD output,
> but yes you shouldn't have to do much but reuse the code. Jake is the
> best approach to Just Try It? see how it runs and figure out the
> output?
>


Certainly "Just Try It" works - as long as you keep track of what
exactly it is you're doing, or else you'll get confused as to why
you're predictions are negative or whatever. :)  But you address
"how" a little below, and I'll jump in-line down there.


> Yes the prediction phase is 'just' the dotting you describe, but in
> the end this may be 10 map reduces, so it's not trivial. Next step is
> to indeed get into details and sketch out, conceptually, what
> mapreduces do what. I'm riffing below; the next step is to write out
> the input/output of each mapper/reducer so we can all think through
> whether it works and is going to be efficient.
>

Prediction, once you have the SVD, should be only one, or at
most two, mapreduces.


> Part 1. Creating user / item vectors
> 1. Reuse org.apache.mahout.cf.hadoop jobs to do that, easy
>

I've been out of the loop - the cf.hadoop jobs produce user vectors
now?  What is the format?  SequenceFile<LongWritable,VectorWritable>?
If it's this, or SequenceFile<IntWritable,VectorWritable>, then you're
in good state for doing DistributedLanczosSolver already, that's great
(and incidentally I bet we could use those jobs other places: the format
of a "preference" file is really just a particular case of "matrix
market<http://math.nist.gov/MatrixMarket/formats.html>
"
format, and we should probably only write a matrix market reader
once in this project. :)

2. Write a new one to compute averages from the user vectors -- what
> are your inputs and outputs? Mapper takes user ID mapped to user
> Vector, outputs...?
>

This is for subtracting off the biases?  Be careful not to turn your
user vectors into dense vectors in this step!


> Part 2. Compute the SVD
> 3. Run Lanczos, I'm guessing, on user vectors.
>

Sounds right at this point.  One important point on this:
DistributedLanczosSolver produces left singular vectors, and the
singular values, but they can be "dirty" - have some duplicates,
have some which are not converged quite enough, not orthogonal
enough, etc.  Thus you should run "EigenVerificationJob" on the
output of that job, and the output of *this* will be "clean" (based
on parameters you set on the job - convergence criteria,
orthogonality, minimum singular value allowed, etc).


> 4? Think about exactly what you need from the SVD -- if you're doing
> what I think then you need U x sqrt(S) and sqrt(S) x V'. And you only
> need the nonzero bits of S and corresponding bits of U and V. Are
> other jobs needed to create these outputs? (I don't know, Jake will!)
>

EigenVerificationJob will output V, and S.  If you want U, then you
can get that by computing userVectors.times(V).times(S), essentially.
This can be done in one map-reduce pass (or two if the transposes
don't line up the right way), by modelling after MatrixMultiplyJob.


>
> Part 3. Recommendation
> 5. Yes the rest is just multiplying those matrices back together but
> good to write out exactly how your'e going to do that in M/R jobs.
> Inputs and outputs.
>

So actually, I should correct myself - if you really want
recommendations, you want entries in userVectors (call it 'A')
which weren't there before:

  A = U S V'

you have A, and V.  Then just take B = A V V' as your
predictions: essentially you project A down into V-space
(the space of "topics"), the lift back up with V', and you'll
be in the space you started with - user ratings.

That make sense?

My personal feel is that performance optimization goes
in these steps: 1) make sure you're maximally making
use of the sequential file access pattern encouraged
by Hadoop to properly distribute each step.  2)
minimize the number of M/R full passes 3) minimize
the size of data output and/or the amount of data
which has to travel all around (ie through the
reduce shuffle), and then 4) anything else.

  -jake

Re: GSoC Update

Reply via email to