Let me take the liberty of moving this onto the dev list since I think it will be most beneficial for you to interact with more than just me.
Jake can speak to everything needed to get to the solved-SVD output, but yes you shouldn't have to do much but reuse the code. Jake is the best approach to Just Try It? see how it runs and figure out the output? Yes the prediction phase is 'just' the dotting you describe, but in the end this may be 10 map reduces, so it's not trivial. Next step is to indeed get into details and sketch out, conceptually, what mapreduces do what. I'm riffing below; the next step is to write out the input/output of each mapper/reducer so we can all think through whether it works and is going to be efficient. Part 1. Creating user / item vectors 1. Reuse org.apache.mahout.cf.hadoop jobs to do that, easy 2. Write a new one to compute averages from the user vectors -- what are your inputs and outputs? Mapper takes user ID mapped to user Vector, outputs...? Part 2. Compute the SVD 3. Run Lanczos, I'm guessing, on user vectors. 4? Think about exactly what you need from the SVD -- if you're doing what I think then you need U x sqrt(S) and sqrt(S) x V'. And you only need the nonzero bits of S and corresponding bits of U and V. Are other jobs needed to create these outputs? (I don't know, Jake will!) Part 3. Recommendation 5. Yes the rest is just multiplying those matrices back together but good to write out exactly how your'e going to do that in M/R jobs. Inputs and outputs. Writing the M/Rs is pretty simple from there. I can quickly iterate with you and clear any roadblocks and/or help debug. That's the easy part. Yes there is no relation to SVDRecommender here. Hard to say what the bottleneck will be but I think I/O is typically a killer here unless you take steps to limit the amount of data generated. After we have the design above nailed down many people can comment about what little hooks and levers should be put in to trim down the computation intelligently. Let's get the above going as a minimum before thinking about SVD++, or anything else like incremental updates. On Tue, Jun 8, 2010 at 10:35 PM, Richard Simon Just <[email protected]> wrote: > I've been feeling a bit stuck, but I think I was just trying to make things > more complicated than they needed to be. So let me walkthrough how I see it > going. > > 1) The conversion of the input, using mostly code that's already available > 2) Calculating Average Rating, User and Item Averages - worse case scenario > this requires 3 map/reduce phases but I'm pretty sure it could be done in 2. > 3) Use that with the DistributedLanczosSolver to solve the SVD > 4) For prediction using the Map phase to calculate individual ratings > consisting of adding the average and biases to a dot product of the > userFactorVector and the ItemFactorVector. This keeps the amount of memory > access at anyone time to a minimum. The reducer would take the output and > produce UserPrefVectors. > > Am I correct in saying that using the Lanczos algorithm completely replaces > the requirement of training used in Mahout's non-distributed SVDRecommender? > > I realise I'm not at the stage of optimisation yet but I was wondering what > is better? Fewer map/reduce cycles, or less file access?
