I would like suggest a tipp: - Download Luke from http://www.getopt.org/luke. - Open a segment index in it. - Select overview - use 'top ranking terms' in the common-terms.utf8
Yes, this is a good idea.
Instead of Luke, one can use the following command to generate this file:
bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index
Doug
