On Tue, Jun 8, 2010 at 3:20 PM, Sean Owen <[email protected]> wrote: > Let me take the liberty of moving this onto the dev list since I think > it will be most beneficial for you to interact with more than just me. > > Jake can speak to everything needed to get to the solved-SVD output, > but yes you shouldn't have to do much but reuse the code. Jake is the > best approach to Just Try It? see how it runs and figure out the > output? >
Certainly "Just Try It" works - as long as you keep track of what exactly it is you're doing, or else you'll get confused as to why you're predictions are negative or whatever. :) But you address "how" a little below, and I'll jump in-line down there. > Yes the prediction phase is 'just' the dotting you describe, but in > the end this may be 10 map reduces, so it's not trivial. Next step is > to indeed get into details and sketch out, conceptually, what > mapreduces do what. I'm riffing below; the next step is to write out > the input/output of each mapper/reducer so we can all think through > whether it works and is going to be efficient. > Prediction, once you have the SVD, should be only one, or at most two, mapreduces. > Part 1. Creating user / item vectors > 1. Reuse org.apache.mahout.cf.hadoop jobs to do that, easy > I've been out of the loop - the cf.hadoop jobs produce user vectors now? What is the format? SequenceFile<LongWritable,VectorWritable>? If it's this, or SequenceFile<IntWritable,VectorWritable>, then you're in good state for doing DistributedLanczosSolver already, that's great (and incidentally I bet we could use those jobs other places: the format of a "preference" file is really just a particular case of "matrix market<http://math.nist.gov/MatrixMarket/formats.html> " format, and we should probably only write a matrix market reader once in this project. :) 2. Write a new one to compute averages from the user vectors -- what > are your inputs and outputs? Mapper takes user ID mapped to user > Vector, outputs...? > This is for subtracting off the biases? Be careful not to turn your user vectors into dense vectors in this step! > Part 2. Compute the SVD > 3. Run Lanczos, I'm guessing, on user vectors. > Sounds right at this point. One important point on this: DistributedLanczosSolver produces left singular vectors, and the singular values, but they can be "dirty" - have some duplicates, have some which are not converged quite enough, not orthogonal enough, etc. Thus you should run "EigenVerificationJob" on the output of that job, and the output of *this* will be "clean" (based on parameters you set on the job - convergence criteria, orthogonality, minimum singular value allowed, etc). > 4? Think about exactly what you need from the SVD -- if you're doing > what I think then you need U x sqrt(S) and sqrt(S) x V'. And you only > need the nonzero bits of S and corresponding bits of U and V. Are > other jobs needed to create these outputs? (I don't know, Jake will!) > EigenVerificationJob will output V, and S. If you want U, then you can get that by computing userVectors.times(V).times(S), essentially. This can be done in one map-reduce pass (or two if the transposes don't line up the right way), by modelling after MatrixMultiplyJob. > > Part 3. Recommendation > 5. Yes the rest is just multiplying those matrices back together but > good to write out exactly how your'e going to do that in M/R jobs. > Inputs and outputs. > So actually, I should correct myself - if you really want recommendations, you want entries in userVectors (call it 'A') which weren't there before: A = U S V' you have A, and V. Then just take B = A V V' as your predictions: essentially you project A down into V-space (the space of "topics"), the lift back up with V', and you'll be in the space you started with - user ratings. That make sense? My personal feel is that performance optimization goes in these steps: 1) make sure you're maximally making use of the sequential file access pattern encouraged by Hadoop to properly distribute each step. 2) minimize the number of M/R full passes 3) minimize the size of data output and/or the amount of data which has to travel all around (ie through the reduce shuffle), and then 4) anything else. -jake
