Re: Algorithm scalability

First Qaxy Wed, 05 May 2010 07:43:30 -0700

Thanks - I'll be looking into the Latent Dirichlet and your recommendation. I'm 
starting to look into clustering also as this supports - I believe -  at least 
one out of the three interest driven recommendation types I plan to implement.

My first type is frequent item sets based on user interest. This is based on 
item-to-item co-occurrences and
 results in items being recommended for other items (just a note here that this 
is end-user facing and not the assoc. rules type of recommendations mean to be 
consumed internally by analysts). For this one - based on the feedback I've 
received so far I'll probably 
use org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.
The second is user level
 recommendation - CF driven I think but I still need to do my homework here. 
This results in a TopN recommendation for each user - this is I believe.
The third one is a mix of the 1st type and clustering. I cluster all the users 
based on their behavior (either browsing or purchasing) and possibly some 
static attributes (gender, age group). I expect to get a number of clusters 
large enough to segment the user base in representative clusters (I have no 
idea at this point how many but I assume a few hundred). Then for each cluster 
compute the item-to-item co-occurrences separately. This results results in 
items being recommended for other items but it is different for each user based 
on his cluster membership. Does this make sense? Is there an easier/better way 
to approach this?
-qf
--- On Wed, 5/5/10, Ankur C. Goel
 <gan...@yahoo-inc.com> wrote:

From: Ankur C. Goel <gan...@yahoo-inc.com>
Subject: Re: Algorithm scalability
To: "mahout-user@lucene.apache.org" <mahout-user@lucene.apache.org>
Received: Wednesday, May 5, 2010, 7:43 AM

Since you are focused on user's interests signaled by the presence/absence of 
items, one of the approaches would be to cluster the users using a 
probabilistic clustering algorithm like LDA - 
https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html. You could 
also use  Minhash - https://issues.apache.org/jira/browse/MAHOUT-344 but the 
patch needs more
 work.

Use the cluster
 information to calculate a weighted score for a candidate item and normalize 
the score across all clusters. Top-N items sorted by their score should make 
for good recommendations.

You will also need to do some pruning before clustering by throwing away users 
with very low item counts or down-sample users with excessively high item 
counts.
-...@nkur

On 5/5/10 4:29 PM, "First Qaxy" <qa...@yahoo.ca> wrote:

Thanks!
Slope-one is on the map when I'll start to look into recommending based on user 
satisfaction(ratings). At this point I'm focusing on user interest which limits 
me to boolean based algorithms.
-qf

--- On Tue, 5/4/10, Sean Owen <sro...@gmail.com> wrote:

From: Sean Owen <sro...@gmail.com>
Subject: Re: Algorithm scalability
To: mahout-user@lucene.apache.org
Received: Tuesday, May 4, 2010, 5:01
 PM

On Tue, May 4, 2010 at 9:53 PM, First Qaxy <qa...@yahoo.ca> wrote:
> Purely based on estimates, assuming 5 billion transactions, 5 million users, 
> 100K products normally distributed are expected to create a sparse item to 
> item matrix of up to 10 Million significant co-occurrences (significance is 
> not globally defined but in the context of the active item to recommend from; 
> in other words support can be really tiny, confidence less so).

Sounds like a pretty solid size of a data set. I think the recommender
will work fine on this -- well, suppose it depends on
 your
expectations but this whole piece has been completely revised recently
and I feel that it's tuned nicely now.

> A few questions:- In 0.3 there was also a 
> org.apache.mahout.cf.taste.hadoop.cooccurence.UserItemRecommender that I 
> canot find in the latest trunk. Was this merged into RecommenderJob?Is there 
> any example or unit test for the hadoop.item.RecommenderJob?- is there any 
> more documentation

This has been merged in org.apache.mahout.cf.taste.hadoop.item.* as
part of this complete overhaul.

>  on hadoop.pseudo? I am still not clear how that is broken into chunks in the 
>case of larger models and how the results are being merged afterwards.- for 
>clustering - if I want to create a few hundred user clusters - is that doable 
>on a model similar to the one described above, based on boolean preferences?

For this scale, I don't think you can use the pseudo-distributed
recommender. It's just too much
 data to get onto individual machines'
memory. In this case nothing is broken down, since non-distributed
algorithms generally use all data. It's just that one non-distributed
recommender is cloned N times so you can crank out recommendations N
times faster very easily.

... well since you don't have all that many items, I could imagine one
algorithm working: slope-one. You would need to use Hadoop to compute
the item-item diffs ahead of time, and prune it. But a pruned set of
item-item diffs fits in memory. You could go this way.

But I think this is the sort of situation very well suited to the
properly distributed implementation.

Re: Algorithm scalability

Reply via email to