Hi,
I am trying to get the top occurring words by building a memory index using
lucene using the code below but I am not getting the desired results. The
text contains 'freedom' three times but it gives only 1. Where am I
committing a mistake. Is there a way out. Please help.
RAMDirectory idx = new RAMDirectory(); //create ram directory
IndexWriter writer =
new IndexWriter(idx, new
StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);
// create the index
writer.addDocument(createDocument("key1",
"It behooves every man to freedom freedom freedom remember
that the work of the ")); // add text to document
try {
computeTopTermQuery(idx); //compute the top term
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
The computeTopTermQuery is from this link
http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html by
Suujit Pal's blog.
private static Query computeTopTermQuery(Directory ramdir) throws
Exception {
final Map<String,Integer> frequencyMap =
new HashMap<String,Integer>();
List<String> termlist = new ArrayList<String>();
IndexReader reader = IndexReader.open(ramdir);
TermEnum terms = reader.terms();
while (terms.next()) {
Term term = terms.term();
String termText = term.text();
int frequency = reader.docFreq(term);
frequencyMap.put(termText, frequency);
termlist.add(termText);
}
reader.close();
// sort the term map by frequency descending
Collections.sort(termlist, new ReverseComparator<String>(
new ByValueComparator<String,Integer>(frequencyMap)));
// retrieve the top terms based on topTermCutoff
List<String> topTerms = new ArrayList<String>();
float topFreq = -1.0F;
for (String term : termlist) {
if (topFreq < 0.0F) {
// first term, capture the value
topFreq = (float) frequencyMap.get(term);
topTerms.add(term);
} else {
// not the first term, compute the ratio and discard if below
// topTermCutoff score
float ratio = (float) ((float) frequencyMap.get(term) /
topFreq);
if (ratio >= topTermCutoff) {
topTerms.add(term);
} else {
break;
}
}
}
StringBuilder termBuf = new StringBuilder();
BooleanQuery q = new BooleanQuery();
for (String topTerm : topTerms) {
termBuf.append(topTerm).
append("(").
append(frequencyMap.get(topTerm)).
append(");");
q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD);
}
System.out.println(">>> top terms: " + termBuf.toString());
System.out.println(">>> query: " + q.toString());
return q;
}
But surprisingly I am getting freedom as (1) and not (3), where 3 is the
occurrences of freedom.
top terms:
accomplished(1);altogether(1);behooves(1);critic(1);does(1);end(1);
every(1);freedom(1);importance(1);man(1);progress(1);remember(1);
secondary(1);things(1);who(1);work(1);
Thanks