Ted / Karl, Thank you both for your comments and suggestions. Continuing
on the comments from Ted...

The end goal is definitely not clustering but rather recommendations.
Thist can be broken down into 2 separate tasks typical to a
recommendation engine.
1. Given a URL show other URLs people have liked.
2. Given a User session and the URL he is seeing, suggest other URLs he
might like.

I experimented a bit with clustering but couldn't get good
recommendations.
>From your advice Log-likelihood ratio sounds like a potential solution
for the first one. I remember having a discussion with you and Sean long
time back where you pointed to a useful paper
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962

Please pardon me if I am asking the question again but do you think it's
a promising approach for problem 1? Do we have an implementation for
this in Mahout? If no then I can open it and work on it (given my other
work commitments allow enough time). Also since I have no formal
statistics background, I am working on 'rebuilding' my statistics
knowledge so that I can grasp these concepts better.

As for the data rates, I really don't know in the context of these
techniques what's low and what's high but what I have learnt after
accumulating weeks of data is that there are few users who have good
engagement (sufficient clicks) over a period of time, moderate number of
users who have small number of clicks and large number of users that
have very few clicks and are just casual surfers.

Also regarding building a user model as a simple mixture, I am not sure
which one you are referring to. Is it the LDA JIRA that Jeff is working
on?

Once again thanks for all the help, much appreciated.

Regards
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Thursday, January 15, 2009 12:58 AM
To: [email protected]
Subject: Re: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

That kind of clustering is not what would answer the important question
for
me.

It sounds like the important question is one of two things:

a) what drives the viewing of the page (using external evidence only and
not
estimating any kind of user model)

b) can we estimate some sort of mental state that leads to viewing
(using
evidence to drive a user model which drives new evidence)

The first approach has the virtue of great simplicity.  To predict a
view,
you can find anomalously coocurrent urls (coocurrent in the sense that
they
were viewed by the some same user).  Then you can use the anomalously
coocurrent urls to build a viewing model, probably using something like
ridged logistic regression or simply using IDF weighting.  I don't think
that NaiveBayes is as good for these models as people think.  Anomalous
coocurrence is not something well detected by Jaccard or cousins, but it
is
well detected by log-likelihood ratio tests based on 2x2 contingency
tables.  Simply using anomalous coocurrent URL's makes for a very good
recommendation engine if you have good data rates.

The second approach has the virtue that you can define a nice metric for
urls.  First you take the likelihood that two urls will both be visited
by a
particular user, then you average over the distribution of all users.
This
gives you an overall coocurrence probability of two urls, the negative
log
of which is likely to be informative if interpreted as a distance.
Building
the user model gives you the virtue of smoothing which is where Jaccard
falls apart.  For low data rates, this is likely to give you the best
results.

Neither of these approaches involves clustering because clustering is
not
the goal.  If you really want the clustering as an end in itself, I
think
what you should restate the coordinate space before clustering.  A good
way
to do that is to build a user model that is a simple mixture, then the
mixture coefficients for each document make good coordinates for normal
clustering algorithms (because dot products have such a natural
interpretation which implies that Euclidean distance is a useful
measure).
In fact, the dot product for the mixture model is exactly the "average
over
all users" idea from above for the special case of a mixture model.
Other
user models might not be so simple.  Jeff Eastman has done a lot of work
on
building such a mixture model estimator for mahout.

I have had cases where clustering and model building were essentially
unable
to interpret the data at all in the original data space, but were able
to
extract enormous amounts of useful information when the data was
restated in
terms of mixture model coefficients.


On Tue, Jan 13, 2009 at 4:33 AM, Ankur (JIRA) <[email protected]> wrote:

> I would like to cluster them in a tree and use the model to answer the
near
> neighborhood type queries i.e. what urls are related to what other
urls. I
> did implement a sequential bottom-up hierarchical clustering algorithm
but
> the complexity is too bad for my data-set. I then thought about
implementing
> a top-down hierarchical clustering algorithm using Jaccard
co-efficient as
> my distance measure and came across this patch.
>

Reply via email to