Andrzej Bialecki wrote:
Sounds like tf/idf might be de-emphasized in scoring. Perhaps
NutchSimilarity.tf() should use log() instead of sqrt() when
field==content?
I don't think it's that simple, the OPIC score is what determined this
behaviour, and it doesn't correspond at all to tf/idf, but to a human
judgement.
If we think that high-OPIC is more valuable than high-content-tf, then
we should use different functions to damp these. Currently both are
damped with sqrt().
I've updated the version of Lucene included with Nutch to have the
required patch. Would you like me to commit IndexSorter.java or would
you?
Please do it. There are two typos in your version of IndexSorter, you
used numDocs() in two places instead of maxDoc(), which for indexes with
deleted docs (after dedup) leads to exceptions.
I have committed this, along with the LuceneQueryOptimizer changes.
I could only find one place where I was using numDocs() instead of maxDoc().
Cheers,
Doug
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers