On Thu, Aug 6, 2009 at 2:57 AM, Ted Dunning<[email protected]> wrote:
>
> Generally, I find it better to use a more nuanced approach to try to get
> something like best common substring, or best over-represented sub-strings.
> A nice way to do that is the log-likelihood ratio test that I use for
> everything under the sun.  This would consider in-cluster and out-of-cluster
> as two classes and would consider the frequency of each possible term or
> phrase in these two classes.  This will give you words and phrases that are
> anomalously common in your cluster and relatively rare outside it.  You may
> want to use document frequency for these comparisons since you can often get
> those frequencies from, for example, a Lucene index more easily than the
> actual number of occurrences.
>

Hi  Ted,

I have little difficulty in understanding LLR for cluster labels.

For a phrase, if
- in-cluster doc frequency is  inDF
- out-of-cluster doc frequency is  outDF
- size of the cluster is clusterSize
- size of the corpus is corpusSize

how do I calculate the LLR?

I have difficulty in mapping these numbers to Event A & Event B that
you talked on your blog.

>From the basic numbers, I could come up with inCluster percentage. But
that doesn't help much. For example,  let's say my cluster size is
2000 documents and corpus size is 1000.  A phrase which occurs in the
cluster in 5 documents and doesn't appear outside cluster has
inCluster percentage of 100. Another phrase which occurs 1000 times in
the cluster and 1000 times outside cluster. This phrase has inCluster
percentage of 50. Intuitively, this is a better candidate for label
that previous one. But, I am unable to figure out how these numbers
need to be normalized.

It would be great if you could shed some light on this.

Thanks,

--shashi

Reply via email to