On Tue, Jun 8, 2010 at 4:03 PM, Sean Owen <[email protected]> wrote:
> On Wed, Jun 9, 2010 at 12:47 AM, Jake Mannix <[email protected]>
> wrote:
> > I've been out of the loop - the cf.hadoop jobs produce user vectors
> > now? What is the format? SequenceFile<LongWritable,VectorWritable>?
>
> Yes exactly so.
>
Too bad it's not IntWritable,VectorWritable, because then we'd actually
be able to transpose() properly - our Vector implementation assumes int
indices. :\
You can do decomposition on this, because transpose isn't needed. But
if one of the later steps needs it...
> This is for subtracting off the biases? Be careful not to turn your
> > user vectors into dense vectors in this step!
>
> Yeah is this step even needed? Ted indicated he also didn't exactly
> see the motivation.
> Is the intuition that the predictions turn out to have mean 0 so the
> real average rating has to be added back in? Well that certainly is
> the implication but I don't see the intuitive reason this is so. (Not
> that that's required.)
>
Yeah, I've never actually done this myself (subtracting off the biases),
and things seem to work out ok, but I don't think I understand exactly
the theory as to *why* it doesn't fail horribly.
> > A = U S V'
> >
> > you have A, and V. Then just take B = A V V' as your
> > predictions: essentially you project A down into V-space
> > (the space of "topics"), the lift back up with V', and you'll
> > be in the space you started with - user ratings.
>
> After retaining only k singular values we really have Uk Sk Vk' whose
> product is Ak, an approximation of A. And yes that matches my
> understanding that the "delta" between Ak and A gives you the
> recommendations - new non-zero values.
>
Note: DistributedLanczosSolver only gets you V and S, not U, btw.
You can of course calculate it, but as I noticed, for this, you don't
need it:
> Did I read right that the process outputs all three of those? So it's
> just multiplying them together right?
>
> I guess I'm not following B = A V V' ... isn't V V' = I so what is that
> doing?
>
V' * V = 1, but V * V' is not, unless k = M (numItems). Instead, V * V'
acts as a "projection" matrix: taking vectors from the space of
user-preferences, and projecting them, keeping the same basis, in the
same space, onto the subspace spanned by the columns of V.
To see it's the same thing as what you were describing:
1) A ~=~ B = U S V'
2) A*V = U S V' V = U S
3) A*V*S^-1 = U
That's how you get U from A, V, and S. Now reconstruct A out of
U, V and S, using line 3) above to substitute in U * S * V' :
U * S * V' = (A * V * S^-1) * S * V' = A * V * V'
Which means you can avoid ever really calculating the full U
matrix at all.
In fact, you can compute A * V * V' in one map-reduce pass
(assuming you can hold V in memory, which might not be
possible, depending on the rank and number of items) :
for(Vector row : A) {
outputCollector.collect(V.timesSquared(row));
}
-jake