Hi, I am trying to get the top occurring words by building a memory index using lucene using the code below but I am not getting the desired results. The text contains 'freedom' three times but it gives only 1. Where am I committing a mistake. Is there a way out. Please help.
RAMDirectory idx = new RAMDirectory(); //create ram directory IndexWriter writer = new IndexWriter(idx, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED); // create the index writer.addDocument(createDocument("key1", "It behooves every man to freedom freedom freedom remember that the work of the ")); // add text to document try { computeTopTermQuery(idx); //compute the top term } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } The computeTopTermQuery is from this link http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html by Suujit Pal's blog. private static Query computeTopTermQuery(Directory ramdir) throws Exception { final Map<String,Integer> frequencyMap = new HashMap<String,Integer>(); List<String> termlist = new ArrayList<String>(); IndexReader reader = IndexReader.open(ramdir); TermEnum terms = reader.terms(); while (terms.next()) { Term term = terms.term(); String termText = term.text(); int frequency = reader.docFreq(term); frequencyMap.put(termText, frequency); termlist.add(termText); } reader.close(); // sort the term map by frequency descending Collections.sort(termlist, new ReverseComparator<String>( new ByValueComparator<String,Integer>(frequencyMap))); // retrieve the top terms based on topTermCutoff List<String> topTerms = new ArrayList<String>(); float topFreq = -1.0F; for (String term : termlist) { if (topFreq < 0.0F) { // first term, capture the value topFreq = (float) frequencyMap.get(term); topTerms.add(term); } else { // not the first term, compute the ratio and discard if below // topTermCutoff score float ratio = (float) ((float) frequencyMap.get(term) / topFreq); if (ratio >= topTermCutoff) { topTerms.add(term); } else { break; } } } StringBuilder termBuf = new StringBuilder(); BooleanQuery q = new BooleanQuery(); for (String topTerm : topTerms) { termBuf.append(topTerm). append("("). append(frequencyMap.get(topTerm)). append(");"); q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD); } System.out.println(">>> top terms: " + termBuf.toString()); System.out.println(">>> query: " + q.toString()); return q; } But surprisingly I am getting freedom as (1) and not (3), where 3 is the occurrences of freedom. top terms: accomplished(1);altogether(1);behooves(1);critic(1);does(1);end(1); every(1);freedom(1);importance(1);man(1);progress(1);remember(1); secondary(1);things(1);who(1);work(1); Thanks