On Sat, Dec 12, 2009 at 12:28 AM, Jake Mannix <jake.man...@gmail.com> wrote:
> Ok, this kind of hook is good, but it leaves all of the work to the user -
> it would
> be nice to extend it along the lines I described, whereby developers can
> define how to pull out various features of their items (or users), and then
> give them a set of Similarities between those features, as well as
> interesting
> combining functions among those.

Maybe we mean different things. How can you write in, say, a notion of
content similarity for books, or travel destinations, or fruit,
without opening the door to a million domain-specific subprojects in
the library? I imagine you're just saying, well, at least you could
take one step, and write something that computes similarity in terms
of abstract features. Yeah, that's a great addition.


> Yeah, this is viewing it as a search problem, and similarly, you can do
> search over 10-50M documents with often even under that latency with
> Lucene, so there's no reason why the two could not be tied nicely together
> to provide a blend of content and usage-based recommendations/searches.

You're saying you want to build out a new style of content-based
recommender? That's good indeed. I imagine it's not really a new
Recommender but a new ItemSimilarity/UserSimilarity framework. Which
is good news, just means it's simpler. If it leverages Lucene, great,
but is that a big dependency to bring in?


> Well, computing the user-item content-based similarity matrix *can*
> be done offline, and once you have it, it can be used to produce
> recommendations online, but another way to do it (and the way we do
> it at LinkedIn), is to keep the items in Voldemort, and store them
> "in transpose" in a Lucene index, and then compute similar items in
> real time as a Lucene query.  Doing item-based recommendations this
> way is just grabbing the sparse set of items a user prefers, OR'ing
> these together (with boosts which encode the preferences), and
> firing away a live search request.

Right now in my mind there are two distinct breeds of recommender in
the framework: the existing online non-distributed bits, and the
forthcoming mostly-offline distributed bits. I'm trying to place the
direction you're going into one of those buckets. It could go into
both in different ways. Which do you have in mind?

While it would be nice to integrate this approach harmoniously into
the existing item-based recommender implementation, it's no big deal
to add in a different style of item-based recommender. Just hoping to
avoid repetition where possible; the project is already becoming a
rich and varied but intimidating bag of tools to solve the same
problem.


> There are a ton of pluggable pieces: there's the hook for field-by-field
> similarity (and not just the hook, but a bunch of common
> implementations), sure, but then there's also a "feature processing /
> extracting" phase, which will be very domain specific, and then the
> scoring hook, where pairwise similarities among fields can be combined
> nontrivially (via logistic regression, via some nonlinear kernel function,
> etc...), as well as a separate system for people to actually *train* those
> scorers - that in itself is a huge component.

The feature processing bit feel outside of scope to me, purely because
I can't see how you would write a general framework to extract
features from 'things' where 'things' are musical instruments,
parties, restaurants, etc. How would you? But everything past that
point is obviously in scope and does not exist yet and should.

Reply via email to