Sure. Happy to send the dissertation, but there will be a short delay. I am traveling and my laptop got stolen so it will be a few days before I get a new machine and reload from backups.
The short answer, however, is that LLR as produced by Mahout should be plenty for your problem. It should be pretty easy to produce a list of interesting key words for each subject code and these are reasonably likely to do a good job of retrieving documents. I would add one step of automated relevance feedback by also extracting key terms by doing a search for documents using the first set of keywords for a particular subject code. Then use the top 20 or so documents in the subject code versus the top 20 or so documents not in the subject code. This will provide a more focused set of keywords that are likely to perform more accurately than the first set. I would keep both sets separately so that you can use either one at will. On Fri, Jul 29, 2011 at 4:26 AM, Dan Brickley <[email protected]> wrote: > Hi Ted, > > In http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html > you mention your dissertation. > > If that's ok, could I have a copy too? > > Context: I have a couple of pretty large (3M, 12M) bibliographic > datasets, both of which have fairly cryptic subject codes, plus short > textual titles applied to a lot of books. We're trying to match these > subject codes with other collections that are only described with a > few simple web2-style tags, so the hope is to see whether the topics > can be augmented with indicative keywords (and maybe later phrases) > extracted from document titles. On the technical side, Lucene/Solr is > already being used so ideally, I'd find a way to apply Mahout's > LogLikelihood to term vectors imported from Lucene indices. On the > theory side I'm wobblier than I'd like to be, so thought a look at > your phd might do me some good... > ...
