On Fri, Dec 11, 2009 at 10:23 PM, Jake Mannix <jake.man...@gmail.com> wrote:
> Where are these hooks you're describing here?  The kind of general
> framework I would imagine would be nice to have is something like this:
> users and items themselves live as (semi-structured) documents (e.g. like
> a Lucene Document, or more generally a Map<String, Map<String, Float>>,
> where the first key is the "field name", and the values are bag-of-words
> term-vectors or phrase vectors).

In particular I'm referring to the ItemSimilarity interface. You stick
that into an item-based recommender (which is really what Ted has been
describing). So to do content-based recommendation, you just implement
the notion of similarity based on content and send it in this way.

Same with UserSimilarity and user-based recommenders.

I imagine this problem can be reduced to a search problem. Maybe vice
versa. I suppose my take on it -- and the reality of it -- is that
what's there is highly specialized for CF. I think it's a good thing,
since the API will be more natural and I imagine it'll be a lot
faster. On my laptop I can do recommendations in about 10ms over 10M
ratings.


> Now the set of users by themselves, instead of just being labels on the
> rows of the preference matrix, is a users-by-terms matrix, and the items,
> instead of being just labels on the columns of the preference matrix, is
> also a items-by-terms matrix.

Yes, this is a fundamentally offline approach right? What exists now
is entirely online. A change in data is reflected immediately. That's
interesting and simple and powerful, but doesn't really scale -- my
rule of thumb is that past 100M data points the non-distributed code
isn't going to work. Below that size -- and that actually describe

The way forward is indeed to write exactly what you and Ted are
talking about: something distributable. And yeah it's going to be a
matrix-based sort of approach. I've started that.

What exists now is more a literal translation of the canonical CF
algorithms, which aren't really rocket science. It's more accessible
than matrix-based, Hadoop-based approaches. But we now need those too.


> The real magic in here is in this last piece, and in an implied piece
> in generating the content-based matrix: different semi-structured
> fields for both the items and users can be pegged against each
> other in different ways, with different weights - let's be concrete,
> and imagine the item-to-item content-based calculation:

It'll be a challenge to integrate content-based approaches to a larger
degree than they already are: what can you really do but offer a hook
to plug in some notion of similarity?

But yes I think we want to re-use the UserSimilarity/ItemSimilarity
hooks even in matrix-based approaches for consistentcy. So...


> Calculating the text-based similarity of *unstructured* documents is
> one thing, and resolves just to figuring out whether you're doing
> BM25, Lucene scoring, pure cosine - just a Similarity decision.

Exactly and this is already implemented in some form as
PearsonCorrelationSimilarity, for example. So the same bits of ideas
are in the existing non-distributed code, it just looks different.


Basically you are clearly interested in
org.apache.mahout.cf.taste.hadoop, and probably don't need to care
about the rest unless you wish to. That's good because the new bits
are the bits that aren't written and that I don't know a lot about.

For example look at .item: this implement Ted's ideas. It's not quite
complete -- I'm not normalizing the recommendation vector yet for
example. So maybe that's a good place to dive in.


... we might even consider naming the distributed CF stuff something
else, since it's actually a totally different implementation than
"cf.taste"

Reply via email to