Great, thanks for the pointers Ted, will take a look when I'm back in
the office tomorrow!
On 5 Aug 2009, at 22:27, Ted Dunning wrote:
LCS is often hilariously bad for various kinds of documents because
it tends
to pick up boiler-plate (my favorite example is "Staff writer of the
Wall
Street Journal" ... a 7 word phrase shared with a huge fraction of the
documents I was working with at the time).
Generally, I find it better to use a more nuanced approach to try to
get
something like best common substring, or best over-represented sub-
strings.
A nice way to do that is the log-likelihood ratio test that I use for
everything under the sun. This would consider in-cluster and out-of-
cluster
as two classes and would consider the frequency of each possible
term or
phrase in these two classes. This will give you words and phrases
that are
anomalously common in your cluster and relatively rare outside it.
You may
want to use document frequency for these comparisons since you can
often get
those frequencies from, for example, a Lucene index more easily than
the
actual number of occurrences.
See my blog about likelihood ratio
tests<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>or
my
original paper on likelihood ratio
tests<http://citeseerx.ist.psu.edu/viewdoc/summary?
doi=10.1.1.54.2186>or
the
section in Manning and Schuetze's book on
collocations<http://books.google.com/books?id=YiFDxbEX3SUC&lpg=PA52&ots=vXuoqtjESI&dq=manning%20and%20schuetze%2C%20pdf&pg=PA172#v=onepage&q=dunning&f=false
>for
more details.
Another approach would be to apply an LLR test to find concepts at
the term
or phrase level that seem over-represented in the cluster. Random
indexing
or DCA or LDA would be reasonable ways to get conceptual level
representations. This would let you handle cases where you don't
have exact
matches on terminology. Random indexing could be combined with sim-
hashes
to get very nice approximate conceptual matches.
On Wed, Aug 5, 2009 at 12:38 PM, Paul Ingles <[email protected]>
wrote:
Hi,
As I've mentioned in the past, I'm working on clustering documents
(albeit
relatively small ones). The cluster mechanism I've ended up with has
produced some pretty good results (at least for what I need to be
able to
do). However, what I'd like to be able to do is find a way to
automate the
naming of these groups.
For example, if each document has a 6/7 word title, I'd like to
produce
names that are somewhat logically ordered (that is they make
grammatical
sense, this can probably be inferred by the frequency in the
clusters: most
documents in a cluster should be well-formed) and share terms
across the
majority of the titles.
So far, I'm using a kind of hacked-together longest common substring
method:
* Sort the titles within the cluster
* Compare every string against every other string, producing a LCS
value
* Use the most common LCS
As this is all relatively new ground for me, I was wondering
whether there
were any better methods I could be using?
Thanks,
Paul
--
Ted Dunning, CTO
DeepDyve