Great, thanks for the pointers Ted, will take a look when I'm back in the office tomorrow!

On 5 Aug 2009, at 22:27, Ted Dunning wrote:

LCS is often hilariously bad for various kinds of documents because it tends to pick up boiler-plate (my favorite example is "Staff writer of the Wall
Street Journal" ... a 7 word phrase shared with a huge fraction of the
documents I was working with at the time).

Generally, I find it better to use a more nuanced approach to try to get something like best common substring, or best over-represented sub- strings.
A nice way to do that is the log-likelihood ratio test that I use for
everything under the sun. This would consider in-cluster and out-of- cluster as two classes and would consider the frequency of each possible term or phrase in these two classes. This will give you words and phrases that are anomalously common in your cluster and relatively rare outside it. You may want to use document frequency for these comparisons since you can often get those frequencies from, for example, a Lucene index more easily than the
actual number of occurrences.

See my blog about likelihood ratio
tests<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html >or
my
original paper on likelihood ratio
tests<http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.54.2186>or
the
section in Manning and Schuetze's book on
collocations<http://books.google.com/books?id=YiFDxbEX3SUC&lpg=PA52&ots=vXuoqtjESI&dq=manning%20and%20schuetze%2C%20pdf&pg=PA172#v=onepage&q=dunning&f=false >for
more details.

Another approach would be to apply an LLR test to find concepts at the term or phrase level that seem over-represented in the cluster. Random indexing
or DCA or LDA would be reasonable ways to get conceptual level
representations. This would let you handle cases where you don't have exact matches on terminology. Random indexing could be combined with sim- hashes
to get very nice approximate conceptual matches.

On Wed, Aug 5, 2009 at 12:38 PM, Paul Ingles <[email protected]> wrote:

Hi,

As I've mentioned in the past, I'm working on clustering documents (albeit
relatively small ones). The cluster mechanism I've ended up with has
produced some pretty good results (at least for what I need to be able to do). However, what I'd like to be able to do is find a way to automate the
naming of these groups.

For example, if each document has a 6/7 word title, I'd like to produce names that are somewhat logically ordered (that is they make grammatical sense, this can probably be inferred by the frequency in the clusters: most documents in a cluster should be well-formed) and share terms across the
majority of the titles.

So far, I'm using a kind of hacked-together longest common substring
method:

* Sort the titles within the cluster
* Compare every string against every other string, producing a LCS value
* Use the most common LCS

As this is all relatively new ground for me, I was wondering whether there
were any better methods I could be using?

Thanks,
Paul




--
Ted Dunning, CTO
DeepDyve

Reply via email to