LCS is often hilariously bad for various kinds of documents because it tends to pick up boiler-plate (my favorite example is "Staff writer of the Wall Street Journal" ... a 7 word phrase shared with a huge fraction of the documents I was working with at the time).
Generally, I find it better to use a more nuanced approach to try to get something like best common substring, or best over-represented sub-strings. A nice way to do that is the log-likelihood ratio test that I use for everything under the sun. This would consider in-cluster and out-of-cluster as two classes and would consider the frequency of each possible term or phrase in these two classes. This will give you words and phrases that are anomalously common in your cluster and relatively rare outside it. You may want to use document frequency for these comparisons since you can often get those frequencies from, for example, a Lucene index more easily than the actual number of occurrences. See my blog about likelihood ratio tests<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>or my original paper on likelihood ratio tests<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.2186>or the section in Manning and Schuetze's book on collocations<http://books.google.com/books?id=YiFDxbEX3SUC&lpg=PA52&ots=vXuoqtjESI&dq=manning%20and%20schuetze%2C%20pdf&pg=PA172#v=onepage&q=dunning&f=false>for more details. Another approach would be to apply an LLR test to find concepts at the term or phrase level that seem over-represented in the cluster. Random indexing or DCA or LDA would be reasonable ways to get conceptual level representations. This would let you handle cases where you don't have exact matches on terminology. Random indexing could be combined with sim-hashes to get very nice approximate conceptual matches. On Wed, Aug 5, 2009 at 12:38 PM, Paul Ingles <[email protected]> wrote: > Hi, > > As I've mentioned in the past, I'm working on clustering documents (albeit > relatively small ones). The cluster mechanism I've ended up with has > produced some pretty good results (at least for what I need to be able to > do). However, what I'd like to be able to do is find a way to automate the > naming of these groups. > > For example, if each document has a 6/7 word title, I'd like to produce > names that are somewhat logically ordered (that is they make grammatical > sense, this can probably be inferred by the frequency in the clusters: most > documents in a cluster should be well-formed) and share terms across the > majority of the titles. > > So far, I'm using a kind of hacked-together longest common substring > method: > > * Sort the titles within the cluster > * Compare every string against every other string, producing a LCS value > * Use the most common LCS > > As this is all relatively new ground for me, I was wondering whether there > were any better methods I could be using? > > Thanks, > Paul > -- Ted Dunning, CTO DeepDyve
