Re: Methods for Naming Clusters

Shashikant Kore Mon, 10 Aug 2009 11:12:01 -0700

Ted,

Thank you for the elaborate explanation.


I think, I just hit the class "this is left as an exercise to the
reader. One last query (I hope)

On your blog you have defined LLR as follows.

>
> LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
> where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log 
> (k_ij / sum(k))
>

I am unable to follow this. Do you have any code to explain this? Or
elaboration for following example will also be equally great.

> Also suppose that the corpus has 100,000 documents in it we have (k11, k12, 
> k21, k22, llr) as
>
> 5, 1995, 100, 97900, 2.96
>

Thanks,

--shashi

On Mon, Aug 10, 2009 at 10:32 PM, Ted Dunning<[email protected]> wrote:
> On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <[email protected]>wrote:
>
>> I have little difficulty in understanding LLR for cluster labels.
>>
>
> Sorry about that.  I will try to be more clear.
>
>
>>  For a phrase, if
>> - in-cluster doc frequency is  inDF
>> - out-of-cluster doc frequency is  outDF
>> - size of the cluster is clusterSize
>> - size of the corpus is corpusSize
>>
>
> Good.
>
>
>> how do I calculate the LLR?
>>
>
> Assuming that the corpus is a superset of the cluster, form the table using:
>
>     k11 = inDF
>     k12 = clusterSize - inDF
>     k21 = outDF
>     k22 = corpusSize - clusterSize - outDF
>
> If the cluster is not a subset of the corpus, then k22 = corpusSize - outDF
>
>
>>  I have difficulty in mapping these numbers to Event A & Event B that
>> you talked on your blog.
>>
>
> Event A is in-cluster, Event B is out-of-cluster.
>
>
>>  From the basic numbers, I could come up with inCluster percentage. But
>> that doesn't help much. For example,  let's say my cluster size is
>> 2000 documents and corpus size is 1000.  A phrase which occurs in the
>> cluster in 5 documents and doesn't appear outside cluster has
>> inCluster percentage of 100. Another phrase which occurs 1000 times in
>> the cluster and 1000 times outside cluster. This phrase has inCluster
>> percentage of 50. Intuitively, this is a better candidate for label
>> that previous one. But, I am unable to figure out how these numbers
>> need to be normalized.
>>
>
> First, the corpus size should normally be much larger than your cluster
> size.  With document categorization, the ratio is enormous, with clustering
> it should still be at least one order of magnitude larger.
>
> So let's take your example and add a case where in-cluster = 5 and
> out-cluster =5, and another where in-cluster=5, out-cluster=100 and another
> where in-cluster=1000 and out-cluster 45,000.
>
> Also suppose that the corpus has 100,000 documents in it we have (k11, k12,
> k21, k22, llr) as
>
> 5, 1995, 0, 98000, 39.33
> 5, 1995, 5, 97995, 25.47
> 5, 1995, 100, 97900, 2.96
> 1000, 1000. 1000, 97000, 5714.93
> 1000, 1000, 45000, 48000, 2.04
>
> According to llr, your original case of 5 in and 0 out is definitely worthy
> of mention and the case with 5 in and 5 out is somewhat less interesting.
> The case with 5 in and 100 out is not very interesting, nor is the case with
> 1000 in and 45000 out.  Your case with 1000 in and 1000 out is the most
> exciting of them all.
>
> The most useful way to think of this is as the percentage of in-cluster
> documents that have the feature (term) versus the percentage out, keeping in
> mind that both percentages are uncertain since we have only a sample of all
> possible documents.  Where these percentages are very different and where
> that difference is unlikely to be due to accidental variation, then LLR will
> be large.
>
>
> I don't know if I mentioned this on the blog, but it is often nice to
> rescale these scores by taking the square root and adding a sign according
> to whether k11/(k11+k12) > k21/(k21+k22).  This gives you a number that has
> the same scale as a normal distribution so lots more people will have good
> intuitions about what is large and what is not.
>

Re: Methods for Naming Clusters

Reply via email to