Re: Methods for Naming Clusters

Shashikant Kore Tue, 11 Aug 2009 00:44:44 -0700

Great!

Solved the equations on paper and now I understand entroy and LLR.


Thanks, Ted.

--shashi

On Tue, Aug 11, 2009 at 12:11 AM, Ted Dunning<[email protected]> wrote:
> I have some R code handy.  It should be easy to translate to Mahoutish Java.
>
> *entropy <- function(x) {-sum(x*log((x+(x==0))/sum(x)))}*
>
> The only tricky bit here is that sum will sum up a vector or a matrix and
> (x==0) returns a matrix or vector of booleans that is the same shape as x.
> Adding a boolean means that all instances where x!=0 are left alone while
> all zeros are incremented by one.  This causes the log to return zero, but
> we are (elementwise) multiplying by x anyway.  Elements of x cannot be < 0
> by definition (they are counts).  The reason for all of that is so that 0
> log 0 = 0 as desired.  If sum(x) = 1, then this function will give you
> Shannon entropy.
>
> and then:
>
> *llr <- function(x) { 2*(entropy(rowSums(x)) + entropy(colSums(x)) -
> entropy(x))}
> *
>
> The functions rowSums and colSums do pretty much what you would expect.
> Given a matrix, they add up the elements in each row or column (to get a
> vector).  I moved around some minus signs to make me happy, btw.  All that
> matters in the end is that entropy and llr both give positive values.
>
> Some test cases:
>
> *> entropy(c(1,1))
> [1] 1.386294
>> llr(matrix(c(1,0,0,1), nrow=2))
> [1] 2.772589
>> llr(matrix(c(10,0,0,10), nrow=2))
> [1] 27.72589
>> llr(matrix(c(5,1995,0,100000), nrow=2))
> [1] 39.33052
>> llr(matrix(c(1000,1995,1000,100000), nrow=2))
> [1] 4730.737
>> llr(matrix(c(1000,1000,1000,100000), nrow=2))
> [1] 5734.343
>> llr(matrix(c(1000,1000,1000,99000), nrow=2))
> [1] 5714.932
> *
>
>
> On Mon, Aug 10, 2009 at 11:11 AM, Shashikant Kore <[email protected]>wrote:
>
>> Ted,
>>
>> Thank you for the elaborate explanation.
>>
>> I think, I just hit the class "this is left as an exercise to the
>> reader. One last query (I hope)
>>
>> On your blog you have defined LLR as follows.
>>
>> >
>> > LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
>> > where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log
>> (k_ij / sum(k))
>> >
>>
>> I am unable to follow this. Do you have any code to explain this? Or
>> elaboration for following example will also be equally great.
>>
>> > Also suppose that the corpus has 100,000 documents in it we have (k11,
>> k12, k21, k22, llr) as
>> >
>> > 5, 1995, 100, 97900, 2.96
>> >
>>
>> Thanks,
>>
>> --shashi
>>
>> On Mon, Aug 10, 2009 at 10:32 PM, Ted Dunning<[email protected]>
>> wrote:
>> > On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <[email protected]
>> >wrote:
>> >
>> >> I have little difficulty in understanding LLR for cluster labels.
>> >>
>> >
>> > Sorry about that.  I will try to be more clear.
>> >
>> >
>> >>  For a phrase, if
>> >> - in-cluster doc frequency is  inDF
>> >> - out-of-cluster doc frequency is  outDF
>> >> - size of the cluster is clusterSize
>> >> - size of the corpus is corpusSize
>> >>
>> >
>> > Good.
>> >
>> >
>> >> how do I calculate the LLR?
>> >>
>> >
>> > Assuming that the corpus is a superset of the cluster, form the table
>> using:
>> >
>> >     k11 = inDF
>> >     k12 = clusterSize - inDF
>> >     k21 = outDF
>> >     k22 = corpusSize - clusterSize - outDF
>> >
>> > If the cluster is not a subset of the corpus, then k22 = corpusSize -
>> outDF
>> >
>> >
>> >>  I have difficulty in mapping these numbers to Event A & Event B that
>> >> you talked on your blog.
>> >>
>> >
>> > Event A is in-cluster, Event B is out-of-cluster.
>> >
>> >
>> >>  From the basic numbers, I could come up with inCluster percentage. But
>> >> that doesn't help much. For example,  let's say my cluster size is
>> >> 2000 documents and corpus size is 1000.  A phrase which occurs in the
>> >> cluster in 5 documents and doesn't appear outside cluster has
>> >> inCluster percentage of 100. Another phrase which occurs 1000 times in
>> >> the cluster and 1000 times outside cluster. This phrase has inCluster
>> >> percentage of 50. Intuitively, this is a better candidate for label
>> >> that previous one. But, I am unable to figure out how these numbers
>> >> need to be normalized.
>> >>
>> >
>> > First, the corpus size should normally be much larger than your cluster
>> > size.  With document categorization, the ratio is enormous, with
>> clustering
>> > it should still be at least one order of magnitude larger.
>> >
>> > So let's take your example and add a case where in-cluster = 5 and
>> > out-cluster =5, and another where in-cluster=5, out-cluster=100 and
>> another
>> > where in-cluster=1000 and out-cluster 45,000.
>> >
>> > Also suppose that the corpus has 100,000 documents in it we have (k11,
>> k12,
>> > k21, k22, llr) as
>> >
>> > 5, 1995, 0, 98000, 39.33
>> > 5, 1995, 5, 97995, 25.47
>> > 5, 1995, 100, 97900, 2.96
>> > 1000, 1000. 1000, 97000, 5714.93
>> > 1000, 1000, 45000, 48000, 2.04
>> >
>> > According to llr, your original case of 5 in and 0 out is definitely
>> worthy
>> > of mention and the case with 5 in and 5 out is somewhat less interesting.
>> > The case with 5 in and 100 out is not very interesting, nor is the case
>> with
>> > 1000 in and 45000 out.  Your case with 1000 in and 1000 out is the most
>> > exciting of them all.
>> >
>> > The most useful way to think of this is as the percentage of in-cluster
>> > documents that have the feature (term) versus the percentage out, keeping
>> in
>> > mind that both percentages are uncertain since we have only a sample of
>> all
>> > possible documents.  Where these percentages are very different and where
>> > that difference is unlikely to be due to accidental variation, then LLR
>> will
>> > be large.
>> >
>> >
>> > I don't know if I mentioned this on the blog, but it is often nice to
>> > rescale these scores by taking the square root and adding a sign
>> according
>> > to whether k11/(k11+k12) > k21/(k21+k22).  This gives you a number that
>> has
>> > the same scale as a normal distribution so lots more people will have
>> good
>> > intuitions about what is large and what is not.
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Reply via email to