On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <[email protected]>wrote:

> I have little difficulty in understanding LLR for cluster labels.
>

Sorry about that.  I will try to be more clear.


>  For a phrase, if
> - in-cluster doc frequency is  inDF
> - out-of-cluster doc frequency is  outDF
> - size of the cluster is clusterSize
> - size of the corpus is corpusSize
>

Good.


> how do I calculate the LLR?
>

Assuming that the corpus is a superset of the cluster, form the table using:

     k11 = inDF
     k12 = clusterSize - inDF
     k21 = outDF
     k22 = corpusSize - clusterSize - outDF

If the cluster is not a subset of the corpus, then k22 = corpusSize - outDF


>  I have difficulty in mapping these numbers to Event A & Event B that
> you talked on your blog.
>

Event A is in-cluster, Event B is out-of-cluster.


>  From the basic numbers, I could come up with inCluster percentage. But
> that doesn't help much. For example,  let's say my cluster size is
> 2000 documents and corpus size is 1000.  A phrase which occurs in the
> cluster in 5 documents and doesn't appear outside cluster has
> inCluster percentage of 100. Another phrase which occurs 1000 times in
> the cluster and 1000 times outside cluster. This phrase has inCluster
> percentage of 50. Intuitively, this is a better candidate for label
> that previous one. But, I am unable to figure out how these numbers
> need to be normalized.
>

First, the corpus size should normally be much larger than your cluster
size.  With document categorization, the ratio is enormous, with clustering
it should still be at least one order of magnitude larger.

So let's take your example and add a case where in-cluster = 5 and
out-cluster =5, and another where in-cluster=5, out-cluster=100 and another
where in-cluster=1000 and out-cluster 45,000.

Also suppose that the corpus has 100,000 documents in it we have (k11, k12,
k21, k22, llr) as

5, 1995, 0, 98000, 39.33
5, 1995, 5, 97995, 25.47
5, 1995, 100, 97900, 2.96
1000, 1000. 1000, 97000, 5714.93
1000, 1000, 45000, 48000, 2.04

According to llr, your original case of 5 in and 0 out is definitely worthy
of mention and the case with 5 in and 5 out is somewhat less interesting.
The case with 5 in and 100 out is not very interesting, nor is the case with
1000 in and 45000 out.  Your case with 1000 in and 1000 out is the most
exciting of them all.

The most useful way to think of this is as the percentage of in-cluster
documents that have the feature (term) versus the percentage out, keeping in
mind that both percentages are uncertain since we have only a sample of all
possible documents.  Where these percentages are very different and where
that difference is unlikely to be due to accidental variation, then LLR will
be large.


I don't know if I mentioned this on the blog, but it is often nice to
rescale these scores by taking the square root and adding a sign according
to whether k11/(k11+k12) > k21/(k21+k22).  This gives you a number that has
the same scale as a normal distribution so lots more people will have good
intuitions about what is large and what is not.

Reply via email to