Hello,
Given a query, I want to be able to, for each query term, get the number of
occurrences of the term. I have tried what I'm including below and it does not
seem to provide reliable results. Seems to work fine with exact matching but
as soon as stemming kicks in, all bets are off as to value of the number of
occurrences returned. E.g. if the query is "refer" the following tokens are
found: "refer", "referred", etc but the actual number of occurrences in the
document differs from what getTermFrequencies reports.
Any ideas, anyone? Can this be written in a simpler and/or more efficient way?
Thanks -
int totalOccurrences = 0;
reader = IndexReader.open(getDirectory(indexDirPath));
HashSet terms = new HashSet();
query.extractTerms(terms);
TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
if (tfvs != null) {
// For each term frequency vector (i.e. for each field)
for (int i = 0; i < tfvs.length; i++) {
String field = tfvs[i].getField();
String[] strTerms = tfvs[i].getTerms();
int[] tfs = tfvs[i].getTermFrequencies();
if (strTerms != null) {
// For each term in the query
for (Iterator iter = terms.iterator(); iter.hasNext();) {
Term term = (Term) iter.next();
// For each term in the vector
for (int j = 0; j < strTerms.length; j++) {
// If found the query term among the vector terms
if (field.equals(term.field()) &&
strTerms[j].equals(term.text())) {
// Add the term frequency to the total
totalOccurrences += tfs[j];
}
}
}
}
}
}