Thanks Mark, the call reader.docFreq(categoryTerm) is certainly a good way to get the nominator part of the IDF formula (http://en.wikipedia.org/wiki/Tf%E2%80%93idf#Mathematical_details).
However, what is left to get is the denominator. For this I want the number of in-category documents that each term appears in (again, categories are in a separate field). Calling reader.docFreq(term) for this would give me the document frequency in the complete collection, but I only want the number of documents that the term appears within a category. So for a query +CATEGORY:sport TEXT:Johnson, I would like to set the IDF to log( (number of all sports documents) / (number of all sports documents that contain Johnson) ) Is there an efficient way for doing this? Cheers, Max On Mon, Oct 18, 2010 at 2:32 PM, mark harwood <markharw...@yahoo.co.uk> wrote: > Can you not just call reader.docFreq(categoryTerm) ? > > The returned figure includes deleted docs but then the search term uses this > method too so should suffer from the same inaccuracy. > > Cheers > Mark > > > > ----- Original Message ---- > From: Max Jakob <max.ja...@fu-berlin.de> > To: java-user@lucene.apache.org > Sent: Mon, 18 October, 2010 12:26:33 > Subject: Consider only documents of a category for IDF > > Hi, > > I would like to change the IDF value of the Lucene similarity > computation to "inverse document frequency inside category". Not the > complete collection should be considered, but only the documents that > have a certain category. The categories are stored as separate fields. > > The implementation below works, but it is kind of slow. I was > wondering if there is a more efficient way than to read the DocIdSet > from the index for each term. > > Thanks in advance for any pointers you might have! > Regards, > Max > > public class InCategorySimilarity extends DefaultSimilarity { > > public InCategorySimilarity() {} > > // These objects have to be here so that they are visible across > multiple executions of idfExplain > OpenBitSet categoryIdSet; > long catDocs = 1; > > @Override > public Explanation.IDFExplanation idfExplain(final Term term, > final Searcher searcher) throws IOException { > return new Explanation.IDFExplanation() { > long termCategoryFreq = 0; > boolean isCategoryField = term.field().equals("CATEGORY"); > > private long termCategoryFreq() { > try { > IndexReader reader = ((IndexSearcher) > searcher).getIndexReader(); > TermsFilter filter = new TermsFilter(); > filter.addTerm(term); > OpenBitSet docSet = (OpenBitSet) filter.getDocIdSet(reader); > > if (isCategoryField) { > categoryIdSet = docSet; > catDocs = categoryIdSet.cardinality(); > } else { > docSet.and(categoryIdSet); > } > termCategoryFreq = docSet.cardinality(); > } catch (IOException e) { > //handle > } > return termCategoryFreq; > } > > public float invCatFreq(long termCategoryFreq, long catDocs) { > return termCategoryFreq==0 ? 0 : (float) (Math.log(new > Float(catDocs) / new Float(termCategoryFreq)) + 1.0); > } > > @Override > public float getIdf() { > termCategoryFreq = termCategoryFreq(); > float invCatFreq = invCatFreq(termCategoryFreq, catDocs); > return invCatFreq; > } > }; > } > } > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org