Andrzej Bialecki wrote:
Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content?

I don't think it's that simple, the OPIC score is what determined this behaviour, and it doesn't correspond at all to tf/idf, but to a human judgement.

If we think that high-OPIC is more valuable than high-content-tf, then we should use different functions to damp these. Currently both are damped with sqrt().

I've updated the version of Lucene included with Nutch to have the required patch. Would you like me to commit IndexSorter.java or would you?

Please do it. There are two typos in your version of IndexSorter, you used numDocs() in two places instead of maxDoc(), which for indexes with deleted docs (after dedup) leads to exceptions.

I have committed this, along with the LuceneQueryOptimizer changes.

I could only find one place where I was using numDocs() instead of maxDoc().

Cheers,

Doug

Reply via email to