[Nutch-dev] Re: IndexSorter optimizer

Doug Cutting Mon, 02 Jan 2006 15:34:10 -0800

Andrzej Bialecki wrote:

Sounds like tf/idf might be de-emphasized in scoring. PerhapsNutchSimilarity.tf() should use log() instead of sqrt() whenfield==content?
I don't think it's that simple, the OPIC score is what determined thisbehaviour, and it doesn't correspond at all to tf/idf, but to a humanjudgement.

If we think that high-OPIC is more valuable than high-content-tf, thenwe should use different functions to damp these. Currently both aredamped with sqrt().

I've updated the version of Lucene included with Nutch to have therequired patch. Would you like me to commit IndexSorter.java or wouldyou?
Please do it. There are two typos in your version of IndexSorter, youused numDocs() in two places instead of maxDoc(), which for indexes withdeleted docs (after dedup) leads to exceptions.


I have committed this, along with the LuceneQueryOptimizer changes.

I could only find one place where I was using numDocs() instead of maxDoc().

Cheers,

Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: IndexSorter optimizer

Reply via email to