On Sat, Sep 5, 2009 at 1:58 PM, Sebastien Bratieres <[email protected]> wrote:
> I've come across this article (Lafferty & Blei 2009) > http://www.citeulike.org/user/maximzhao/article/5084329 which seems to > build > upon Ted's log likelihood ratio. Yes. And they do a good job of it. Their backoff model is essentially identical to one that I proposed in my dissertation, but the permutation test is a good addition for deciding which LR's useful in finding new n-grams for the model. The permutation test that they propose is also very similar to the one that I used for analyzing genetic sequence data with LR's, although in very different context. > Ted, I'd be interested in knowing your opinion on this article; most > importantly, how easily it can be implemented and what improvement it > brings > over LLR. > That really depends a lot on many factors. The primary factor is how much data you have to work with. That scale has two effects. First, it makes fancy techniques much harder to apply. Secondly, it makes simpler techniques work better and better. LLR's are very hard to beat in terms of complexity. In general, I will pretty much always opt for LLR computed independently for all collocations simultaneously. This definitely has the defects described in the paper, but you can do the computation for all n-gram extensions with at worse a single pass over the data. That means that you need at most a few passes (typically 7 or less). There are also clever data structures that can be built pretty economically that let you do all computations with a single pass. These clever structures, however, are difficult to scale. If your data is moderate in size (say wikipiedia), passes over the data only take a few minutes. For large clusters and web-scale corpora, you can pass over the data in a few hours. So to start, I just use the simplest possible implementation and evaluate the results with users. If the results are good enough, then I am done. The next step is generally to use LR mostly as described in this paper. The only change is that I would recommend using a threshold to incorporate large numbers of n-grams at a time into the model. This minimizes the number of passes over the data, but you will definitely have a longer process. This will likely pull in higher quality phrases than the simpler model at some computational cost. Finally, I would consider more elaborate processes such as Lafferty and Blei suggest. The real problem here is that they don't specify quite what they do. They give a constraint (permute, preserving known n-gram dependencies), but not an exact procedure. As such, it will be hard to replicate the process without asking a few questions. In practice, though, I would spend much more effort on good topic modeling. In my experience, decent phrase discover is not discernibly worse than great phrase discover, at least viewed on a scale determined by the difference between good topic discovery and bad topic discovery. Since I am usually doing this work at a startup, development time is massively limited and best-bang is the only feasible scheduling algorithm. The next step for Mahout to explore any of this is to build a good co-occurrence counter.
