Andrzej Bialecki wrote:
Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content?

I don't think it's that simple, the OPIC score is what determined this behaviour, and it doesn't correspond at all to tf/idf, but to a human judgement.

If we think that high-OPIC is more valuable than high-content-tf, then we should use different functions to damp these. Currently both are damped with sqrt().

I've updated the version of Lucene included with Nutch to have the required patch. Would you like me to commit IndexSorter.java or would you?

Please do it. There are two typos in your version of IndexSorter, you used numDocs() in two places instead of maxDoc(), which for indexes with deleted docs (after dedup) leads to exceptions.

I have committed this, along with the LuceneQueryOptimizer changes.

I could only find one place where I was using numDocs() instead of maxDoc().

Cheers,

Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to