Dear Lucene Users: What is the best way to get the most common terms for a subset of the total documents in your index?
I know how to get the most common terms for a field for the entire index, but what is the most efficient way to do this for a subset of documents? Here is the code I am using to get the top "numberOfTerms" common terms for the field "fieldName": public TermInfo[] mostCommonTerms(String fieldName, int numberOfTerms) { //make sure min will get a positive number if (numberOfTerms < 1) { numberOfTerms = Integer.MAX_VALUE; } numberOfTerms = Math.min(numberOfTerms, 50); //String[] commonTerms = new String[numberOfTerms]; try { IndexReader reader = IndexReader.open(indexPath); TermInfoQueue tiq = new TermInfoQueue(numberOfTerms); TermEnum terms = reader.terms(); int minFreq = 0; while (terms.next()) { if(fieldName.equalsIgnoreCase(terms.term().field())) { if (terms.docFreq() > minFreq) { tiq.put(new TermInfo(terms.term(), terms.docFreq())); if (tiq.size() >= numberOfTerms) // if tiq overfull { tiq.pop(); // remove lowest in tiq minFreq = ((TermInfo) tiq.top()).docFreq; // reset // minFreq } } } } TermInfo[] res = new TermInfo[tiq.size()]; for (int i = 0; i < res.length; i++) { res[res.length - i - 1] = (TermInfo) tiq.pop(); } reader.close(); return res; } catch (IOException ioe) { logger.error("IOException: " + ioe.getMessage()); } return null; } --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]