Re: Methods for Naming Clusters

Ted Dunning Sat, 05 Sep 2009 16:28:04 -0700

On Sat, Sep 5, 2009 at 1:58 PM, Sebastien Bratieres <[email protected]> wrote:


> I've come across this article (Lafferty & Blei 2009)
> http://www.citeulike.org/user/maximzhao/article/5084329 which seems to
> build
> upon Ted's log likelihood ratio.


Yes.  And they do a good job of it.  Their backoff model is essentially
identical to one that I proposed in my dissertation, but the permutation
test is a good addition for deciding which LR's useful in finding new
n-grams for the model.  The permutation test that they propose is also very
similar to the one that I used for analyzing genetic sequence data with
LR's, although in very different context.


> Ted, I'd be interested in knowing your opinion on this article; most
> importantly, how easily it can be implemented and what improvement it
> brings
> over LLR.
>

That really depends a lot on many factors.  The primary factor is how much
data you have to work with.  That scale has two effects.  First, it makes
fancy techniques much harder to apply.  Secondly, it makes simpler
techniques work better and better.  LLR's are very hard to beat in terms of
complexity.

In general, I will pretty much always opt for LLR computed independently for
all collocations simultaneously.  This definitely has the defects described
in the paper, but you can do the computation for all n-gram extensions with
at worse a single pass over the data.  That means that you need at most a
few passes (typically 7 or less).  There are also clever data structures
that can be built pretty economically that let you do all computations with
a single pass.  These clever structures, however, are difficult to scale.
If your data is moderate in size (say wikipiedia), passes over the data only
take a few minutes.  For large clusters and web-scale corpora, you can pass
over the data in a few hours.

So to start, I just use the simplest possible implementation and evaluate
the results with users.  If the results are good enough, then I am done.

The next step is generally to use LR mostly as described in this paper.  The
only change is that I would recommend using a threshold to incorporate large
numbers of n-grams at a time into the model.  This minimizes the number of
passes over the data, but you will definitely have a longer process.  This
will likely pull in higher quality phrases than the simpler model at some
computational cost.

Finally, I would consider more elaborate processes such as Lafferty and Blei
suggest.  The real problem here is that they don't specify quite what they
do.  They give a constraint (permute, preserving known n-gram dependencies),
but not an exact procedure.  As such, it will be hard to replicate the
process without asking a few questions.

In practice, though, I would spend much more effort on good topic modeling.
In my experience, decent phrase discover is not discernibly worse than great
phrase discover, at least viewed on a scale determined by the difference
between good topic discovery and bad topic discovery.  Since I am usually
doing this work at a startup, development time is massively limited and
best-bang is the only feasible scheduling algorithm.

The next step for Mahout to explore any of this is to build a good
co-occurrence counter.

Re: Methods for Naming Clusters

Reply via email to