Sure.   Happy to send the dissertation, but there will be a short delay.  I
am traveling and my laptop got stolen so it will be a few days before I get
a new machine and reload from backups.

The short answer, however, is that LLR as produced by Mahout should be
plenty for your problem.  It should be pretty easy to produce a list of
interesting key words for each subject code and these are reasonably likely
to do a good job of retrieving documents.

I would add one step of automated relevance feedback by also extracting key
terms by doing a search for documents using the first set of keywords for a
particular subject code.  Then use the top 20 or so documents in the subject
code versus the top 20 or so documents not in the subject code.  This will
provide a more focused set of keywords that are likely to perform more
accurately than the first set.  I would keep both sets separately so that
you can use either one at will.

On Fri, Jul 29, 2011 at 4:26 AM, Dan Brickley <[email protected]> wrote:

> Hi Ted,
>
> In http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
> you mention your dissertation.
>
> If that's ok, could I have a copy too?
>
> Context: I have a couple of pretty large (3M, 12M) bibliographic
> datasets, both of which have fairly cryptic subject codes, plus short
> textual titles applied to a lot of books. We're trying to match these
> subject codes with other collections that are only described with a
> few simple web2-style tags, so the hope is to see whether the topics
> can be augmented with indicative keywords (and maybe later phrases)
> extracted from document titles. On the technical side, Lucene/Solr is
> already being used so ideally, I'd find a way to apply Mahout's
> LogLikelihood to term vectors imported from Lucene indices. On the
> theory side I'm wobblier than I'd like to be, so thought a look at
> your phd might do me some good...
> ...

Reply via email to