Hi,

I am trying to get the top occurring words by building a memory index using
lucene using the code below but I am not getting the desired results. The
text contains 'freedom' three times but it gives only 1. Where am I
committing a mistake. Is there a way out. Please help.

RAMDirectory idx = new RAMDirectory(); //create ram directory
IndexWriter writer =
                     new IndexWriter(idx, new
StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);
// create the index

 writer.addDocument(createDocument("key1",
    "It behooves every man to freedom freedom freedom remember
that                    the work of the "));  // add text to document



             try {
                computeTopTermQuery(idx);  //compute the top term
            } catch (Exception e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

The computeTopTermQuery is from this link
http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html  by
Suujit Pal's blog.

  private static Query computeTopTermQuery(Directory ramdir) throws
Exception {
        final Map<String,Integer> frequencyMap =
          new HashMap<String,Integer>();
        List<String> termlist = new ArrayList<String>();
        IndexReader reader = IndexReader.open(ramdir);
        TermEnum terms = reader.terms();
        while (terms.next()) {
          Term term = terms.term();
          String termText = term.text();
          int frequency = reader.docFreq(term);
          frequencyMap.put(termText, frequency);
          termlist.add(termText);
        }
        reader.close();
        // sort the term map by frequency descending
        Collections.sort(termlist, new ReverseComparator<String>(
          new ByValueComparator<String,Integer>(frequencyMap)));
        // retrieve the top terms based on topTermCutoff
        List<String> topTerms = new ArrayList<String>();
        float topFreq = -1.0F;
        for (String term : termlist) {
          if (topFreq < 0.0F) {
            // first term, capture the value
            topFreq = (float) frequencyMap.get(term);
            topTerms.add(term);
          } else {
            // not the first term, compute the ratio and discard if below
            // topTermCutoff score
            float ratio = (float) ((float) frequencyMap.get(term) /
topFreq);
            if (ratio >= topTermCutoff) {
              topTerms.add(term);
            } else {
              break;
            }
          }
        }
        StringBuilder termBuf = new StringBuilder();
        BooleanQuery q = new BooleanQuery();
        for (String topTerm : topTerms) {
          termBuf.append(topTerm).
            append("(").
            append(frequencyMap.get(topTerm)).
            append(");");
          q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD);
        }
        System.out.println(">>> top terms: " + termBuf.toString());
        System.out.println(">>> query: " + q.toString());
        return q;
      }


But surprisingly I am getting freedom as (1) and not (3), where 3 is the
occurrences of freedom.

top terms:
accomplished(1);altogether(1);behooves(1);critic(1);does(1);end(1);
every(1);freedom(1);importance(1);man(1);progress(1);remember(1);
secondary(1);things(1);who(1);work(1);

Thanks

Reply via email to