Ted / Karl, Thank you both for your comments and suggestions. Continuing on the comments from Ted...
The end goal is definitely not clustering but rather recommendations. Thist can be broken down into 2 separate tasks typical to a recommendation engine. 1. Given a URL show other URLs people have liked. 2. Given a User session and the URL he is seeing, suggest other URLs he might like. I experimented a bit with clustering but couldn't get good recommendations. >From your advice Log-likelihood ratio sounds like a potential solution for the first one. I remember having a discussion with you and Sean long time back where you pointed to a useful paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962 Please pardon me if I am asking the question again but do you think it's a promising approach for problem 1? Do we have an implementation for this in Mahout? If no then I can open it and work on it (given my other work commitments allow enough time). Also since I have no formal statistics background, I am working on 'rebuilding' my statistics knowledge so that I can grasp these concepts better. As for the data rates, I really don't know in the context of these techniques what's low and what's high but what I have learnt after accumulating weeks of data is that there are few users who have good engagement (sufficient clicks) over a period of time, moderate number of users who have small number of clicks and large number of users that have very few clicks and are just casual surfers. Also regarding building a user model as a simple mixture, I am not sure which one you are referring to. Is it the LDA JIRA that Jeff is working on? Once again thanks for all the help, much appreciated. Regards -Ankur -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Thursday, January 15, 2009 12:58 AM To: [email protected] Subject: Re: [jira] Commented: (MAHOUT-19) Hierarchial clusterer That kind of clustering is not what would answer the important question for me. It sounds like the important question is one of two things: a) what drives the viewing of the page (using external evidence only and not estimating any kind of user model) b) can we estimate some sort of mental state that leads to viewing (using evidence to drive a user model which drives new evidence) The first approach has the virtue of great simplicity. To predict a view, you can find anomalously coocurrent urls (coocurrent in the sense that they were viewed by the some same user). Then you can use the anomalously coocurrent urls to build a viewing model, probably using something like ridged logistic regression or simply using IDF weighting. I don't think that NaiveBayes is as good for these models as people think. Anomalous coocurrence is not something well detected by Jaccard or cousins, but it is well detected by log-likelihood ratio tests based on 2x2 contingency tables. Simply using anomalous coocurrent URL's makes for a very good recommendation engine if you have good data rates. The second approach has the virtue that you can define a nice metric for urls. First you take the likelihood that two urls will both be visited by a particular user, then you average over the distribution of all users. This gives you an overall coocurrence probability of two urls, the negative log of which is likely to be informative if interpreted as a distance. Building the user model gives you the virtue of smoothing which is where Jaccard falls apart. For low data rates, this is likely to give you the best results. Neither of these approaches involves clustering because clustering is not the goal. If you really want the clustering as an end in itself, I think what you should restate the coordinate space before clustering. A good way to do that is to build a user model that is a simple mixture, then the mixture coefficients for each document make good coordinates for normal clustering algorithms (because dot products have such a natural interpretation which implies that Euclidean distance is a useful measure). In fact, the dot product for the mixture model is exactly the "average over all users" idea from above for the special case of a mixture model. Other user models might not be so simple. Jeff Eastman has done a lot of work on building such a mixture model estimator for mahout. I have had cases where clustering and model building were essentially unable to interpret the data at all in the original data space, but were able to extract enormous amounts of useful information when the data was restated in terms of mixture model coefficients. On Tue, Jan 13, 2009 at 4:33 AM, Ankur (JIRA) <[email protected]> wrote: > I would like to cluster them in a tree and use the model to answer the near > neighborhood type queries i.e. what urls are related to what other urls. I > did implement a sequential bottom-up hierarchical clustering algorithm but > the complexity is too bad for my data-set. I then thought about implementing > a top-down hierarchical clustering algorithm using Jaccard co-efficient as > my distance measure and came across this patch. >
