LCS is often hilariously bad for various kinds of documents because it tends
to pick up boiler-plate (my favorite example is "Staff writer of the Wall
Street Journal" ... a 7 word phrase shared with a huge fraction of the
documents I was working with at the time).

Generally, I find it better to use a more nuanced approach to try to get
something like best common substring, or best over-represented sub-strings.
A nice way to do that is the log-likelihood ratio test that I use for
everything under the sun.  This would consider in-cluster and out-of-cluster
as two classes and would consider the frequency of each possible term or
phrase in these two classes.  This will give you words and phrases that are
anomalously common in your cluster and relatively rare outside it.  You may
want to use document frequency for these comparisons since you can often get
those frequencies from, for example, a Lucene index more easily than the
actual number of occurrences.

See my blog about likelihood ratio
tests<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>or
my
original paper on likelihood ratio
tests<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.2186>or
the
section in Manning and Schuetze's book on
collocations<http://books.google.com/books?id=YiFDxbEX3SUC&lpg=PA52&ots=vXuoqtjESI&dq=manning%20and%20schuetze%2C%20pdf&pg=PA172#v=onepage&q=dunning&f=false>for
more details.

Another approach would be to apply an LLR test to find concepts at the term
or phrase level that seem over-represented in the cluster.  Random indexing
or DCA or LDA would be reasonable ways to get conceptual level
representations.  This would let you handle cases where you don't have exact
matches on terminology.  Random indexing could be combined with sim-hashes
to get very nice approximate conceptual matches.

On Wed, Aug 5, 2009 at 12:38 PM, Paul Ingles <[email protected]> wrote:

> Hi,
>
> As I've mentioned in the past, I'm working on clustering documents (albeit
> relatively small ones). The cluster mechanism I've ended up with has
> produced some pretty good results (at least for what I need to be able to
> do). However, what I'd like to be able to do is find a way to automate the
> naming of these groups.
>
> For example, if each document has a 6/7 word title, I'd like to produce
> names that are somewhat logically ordered (that is they make grammatical
> sense, this can probably be inferred by the frequency in the clusters: most
> documents in a cluster should be well-formed) and share terms across the
> majority of the titles.
>
> So far, I'm using a kind of hacked-together longest common substring
> method:
>
> * Sort the titles within the cluster
> * Compare every string against every other string, producing a LCS value
> * Use the most common LCS
>
> As this is all relatively new ground for me, I was wondering whether there
> were any better methods I could be using?
>
> Thanks,
> Paul
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to