Re: Methods for Naming Clusters

Paul Ingles Wed, 05 Aug 2009 14:54:51 -0700

Great, thanks for the pointers Ted, will take a look when I'm back inthe office tomorrow!


On 5 Aug 2009, at 22:27, Ted Dunning wrote:

LCS is often hilariously bad for various kinds of documents becauseit tendsto pick up boiler-plate (my favorite example is "Staff writer of theWall
Street Journal" ... a 7 word phrase shared with a huge fraction of the
documents I was working with at the time).
Generally, I find it better to use a more nuanced approach to try togetsomething like best common substring, or best over-represented sub-strings.
A nice way to do that is the log-likelihood ratio test that I use for
everything under the sun. This would consider in-cluster and out-of-clusteras two classes and would consider the frequency of each possibleterm orphrase in these two classes. This will give you words and phrasesthat areanomalously common in your cluster and relatively rare outside it.You maywant to use document frequency for these comparisons since you canoften getthose frequencies from, for example, a Lucene index more easily thanthe
actual number of occurrences.

See my blog about likelihood ratio
tests<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>or
my
original paper on likelihood ratio
tests<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.2186>or
the
section in Manning and Schuetze's book on
collocations<http://books.google.com/books?id=YiFDxbEX3SUC&lpg=PA52&ots=vXuoqtjESI&dq=manning%20and%20schuetze%2C%20pdf&pg=PA172#v=onepage&q=dunning&f=false>for
more details.
Another approach would be to apply an LLR test to find concepts atthe termor phrase level that seem over-represented in the cluster. Randomindexing
or DCA or LDA would be reasonable ways to get conceptual level
representations. This would let you handle cases where you don'thave exactmatches on terminology. Random indexing could be combined with sim-hashes
to get very nice approximate conceptual matches.
On Wed, Aug 5, 2009 at 12:38 PM, Paul Ingles <[email protected]>wrote:
Hi,
As I've mentioned in the past, I'm working on clustering documents(albeit
relatively small ones). The cluster mechanism I've ended up with has
produced some pretty good results (at least for what I need to beable todo). However, what I'd like to be able to do is find a way toautomate the
naming of these groups.
For example, if each document has a 6/7 word title, I'd like toproducenames that are somewhat logically ordered (that is they makegrammaticalsense, this can probably be inferred by the frequency in theclusters: mostdocuments in a cluster should be well-formed) and share termsacross the
majority of the titles.

So far, I'm using a kind of hacked-together longest common substring
method:

* Sort the titles within the cluster
* Compare every string against every other string, producing a LCSvalue
* Use the most common LCS
As this is all relatively new ground for me, I was wonderingwhether there
were any better methods I could be using?

Thanks,
Paul
--
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Reply via email to