Re: GSoC Update

Sean Owen Tue, 08 Jun 2010 15:21:23 -0700

Let me take the liberty of moving this onto the dev list since I think
it will be most beneficial for you to interact with more than just me.

Jake can speak to everything needed to get to the solved-SVD output,
but yes you shouldn't have to do much but reuse the code. Jake is the
best approach to Just Try It? see how it runs and figure out the
output?

Yes the prediction phase is 'just' the dotting you describe, but in
the end this may be 10 map reduces, so it's not trivial. Next step is
to indeed get into details and sketch out, conceptually, what
mapreduces do what. I'm riffing below; the next step is to write out
the input/output of each mapper/reducer so we can all think through
whether it works and is going to be efficient.

Part 1. Creating user / item vectors
1. Reuse org.apache.mahout.cf.hadoop jobs to do that, easy
2. Write a new one to compute averages from the user vectors -- what
are your inputs and outputs? Mapper takes user ID mapped to user
Vector, outputs...?

Part 2. Compute the SVD
3. Run Lanczos, I'm guessing, on user vectors.
4? Think about exactly what you need from the SVD -- if you're doing
what I think then you need U x sqrt(S) and sqrt(S) x V'. And you only
need the nonzero bits of S and corresponding bits of U and V. Are
other jobs needed to create these outputs? (I don't know, Jake will!)

Part 3. Recommendation
5. Yes the rest is just multiplying those matrices back together but
good to write out exactly how your'e going to do that in M/R jobs.
Inputs and outputs.

Writing the M/Rs is pretty simple from there. I can quickly iterate
with you and clear any roadblocks and/or help debug. That's the easy
part.

Yes there is no relation to SVDRecommender here.

Hard to say what the bottleneck will be but I think I/O is typically a
killer here unless you take steps to limit the amount of data
generated. After we have the design above nailed down many people can
comment about what little hooks and levers should be put in to trim
down the computation intelligently.

Let's get the above going as a minimum before thinking about SVD++, or
anything else like incremental updates.

On Tue, Jun 8, 2010 at 10:35 PM, Richard Simon Just
<[email protected]> wrote:
> I've been feeling a bit stuck, but I think I was just trying to make things
> more complicated than they needed to be. So let me walkthrough how I see it
> going.
>
> 1) The conversion of the input, using mostly code that's already available
> 2) Calculating Average Rating, User and Item Averages - worse case scenario
> this requires 3 map/reduce phases but I'm pretty sure it could be done in 2.
> 3) Use that with the DistributedLanczosSolver to solve the SVD
> 4) For prediction using the Map phase to calculate individual ratings
> consisting of adding the average and biases to a dot product of the
> userFactorVector and the ItemFactorVector. This keeps the amount of memory
> access at anyone time to a minimum. The reducer would take the output and
> produce UserPrefVectors.
>
> Am I correct in saying that using the Lanczos algorithm completely replaces
> the requirement of training used in Mahout's non-distributed SVDRecommender?
>
> I realise I'm not at the stage of optimisation yet but I was wondering what
> is better? Fewer map/reduce cycles, or less file access?

Re: GSoC Update

Reply via email to