On Mon, May 31, 2010 at 4:01 PM, Sean Owen <[email protected]> wrote:
>
>
> So I don't know that SQL should go away, if only because, for every
> business with Cassandra set up there are 1000 with a database. And for
> every user with huge data, there are 100 with medium-sized data. I
> don't want to lose sight of supporting the common case on the way.
>
> One larger point I'm making, which I don't believe anyone's disputing,
> is that there are more than just huge item-based recommenders in play
> here. What's below makes total sense, for one algorithm, though.
>

I'm certainly not recommending we _drop_ SQL!  Just add support for
noSQL.  Maybe there's a problem with that moniker? :)


> Yes this is more or less how the item-based recommender works now,
> when paired up with a JDBC-backed ItemSimilarity. (It could use a
> "bulk query" method to get all those similarities in one go though;
> not too hard to weave in.)
>
> I'd be a little concerned about whether this fits comfortably in
> memory. The similarity matrix is potentially dense -- big rows -- and
> you're loading one row per item the user has rated. It could get into
> tens of megabytes for one query. The distributed version dares not do
> this. But, worth a try in principle.
>

Nonono, we'd definitely have to make an approximation here: trim down
the ItemSimilarity matrix to be more sparse: either by removing
similarities in a row that are below a certain cutoff (as long as there
are enough entries in that row: if that row only has e.g. 10 nonzero
similarities, keep them all).  So that any call to the data store to get
all of the similarities (all of the closest items with their similarities,
forgetting about the ones farther away) for a given set of items
only return, like Ted suggested, 100 or 1000 * numItemsThisUserRated.
Should be retrieving well under 100KB at a time, hopefully.


> > Seems like it would be a nice integration to try out.  The advantage of
> > that kind of recommender (on the fly) is that you could change the way
> > you compute recommendations (ie. the model) on a per-query basis
> > if need be, and if new items were added to the users list of things
> > they'd rated, that users' row in the key-value store could be updated
> > on the fly too (the ItemSimilarity matrix would drift out of date, sure,
> > but it could be batch updated periodically).
>
> +1 for someone to try it. I'm currently more interested in the Hadoop
> end of the spectrum myself.
>

Yep, I hear ya.  I'm just trying to chime in with a nice "+1" to encouraging
any patches which allow for this kind of thing.  Someday I'll have the
time do look at these kinds of fun additions myself, but it isn't quite now.

  -jake

Reply via email to