Hi Mike, If you just need the IDF you can run HighFreqTerm.java in contrib against either your sample index or your index to get the N terms with the highest DF values (i.e. lowest IDF.) If you have a large index, giving it lots of memory seems to help.
Depending on your use case, you may instead want to run it with the "-t" flag which will get the terms with the highest total occurrences (total tf), which is a good measure of the size of the positions list for those terms. The size of the positions list only matters if you allow phrase or proximity queries. See: http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=markup Regarding the positions list and slow phrase queries see: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 You can also look at the standard stop word sets at http://snowball.tartarus.org/ (look under the entries for each stemmer) or http://search.cpan.org/~creamyg/Lingua-StopWords-0.09/ or http://members.unine.ch/jacques.savoy/clef/index.html Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -----Original Message----- From: Mike O'Leary [mailto:tmole...@uw.edu] Sent: Thursday, December 15, 2011 12:34 PM To: java-user@lucene.apache.org Subject: Obtaining IDF values for the terms in a document set We have a large set of documents that we would like to index with a customized stopword list. We have run tests by indexing a random set of about 10% of the documents, and we'd like to generate a list of the terms in that smaller set and their IDF values as a way to create a starter set of stopwords for the larger document set by selecting the terms that have the lowest IDF values. First of all, is this the best way to create a stopword list? Second, is there a straightforward way to generate a list of terms and their IDF values from a Lucene index? Thanks, Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org