I've done this by comparing term frequency in a subset (in Amazon's case a single book) and looking for a significant "uplift" in term popularity vs that of the general corpus popularity. Practically speaking, in the amazon case you can treat each page in the example book as a Lucene document, create a RAMDirectory and then use it's TermEnum to get the docFreqs for all words and compare them with the corpus docFreqs.
The "uplift" score for each term is (subsetDocFreq/subsetNumDocs)-(corpusDocFreq/corpusNumDocs) Take the top "n" terms scored by the above then analyze the text of the subset looking for runs of these terms. I have some code for this that I have wanted to package up as a contribution for some time. ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]